Ep. 64: Building "Shazam for Sheet Music": Songscription CEO Andrew Carlins

Episode Description:

Andrew Carlins is the CEO and co-founder of Songscription, a company that uses AI models to automate music transcription from audio recordings.  The company is still in the early stages, having launched only a few months ago at the time of this recording. But they’ve got big ideas about how to use technology to revolutionize music education, and it’s a conversation you won’t want to miss.

Featured On This Episode:
Andrew Carlins

Andrew is a co-founder and CEO of Songscription.ai, a web-based application that automates music transcription. He is a current MBA and MA in education student at Stanford University and grew up playing the piano and baritone saxophone.

Episode Transcript:

*Episode transcripts are automatically generated and have NOT been proofread.*

Andrew Carlins, welcome to Selling Sheet Music. How are you doing?

Doing really well. Thanks for having me, Gara.

I appreciate you taking the time to come on the show and talk about yourself and your company and AI music. That’s something that a lot of us artists or creators in the music industry have a lot of feelings about, right?

And there’s a lot of things that are unknown about it that we’re trying to figure out. And so I’m really grateful for you taking the time to walk us through some of it.

Before we get into the weeds with the technology, could you just introduce yourself briefly to our listeners? Tell us about your musical background and what it is that you hope to achieve with Songscription.

Yeah, absolutely. I grew up playing the piano and baritone saxophone. I was drawn to music mostly because I grew up with a stutter.

And when you sing, you don’t stutter. And certainly when I played my saxophone and my piano, I didn’t stutter either.

There is this great organization called the Stuttering Association for the Young that empowers people, empowers children who stutter through the arts, recognizing that when you sing or play an instrument, it’s someone’s one unique moment of fluency.

And because of that, I found a lot of… The arts really helped me find my voice. And I participated in musical theater and played in marching band and symphonic band and jazz band and piano, classical piano.

I really, really, really loved music from a very young age. And in terms of what we’re building at Songscription, we seek to…

Their mission statement, our lofty goal is to empower musicians worldwide to play, share, learn, and create the songs that they love. Songscription’s approach is a little bit different from… I would…

Garrett, you mentioned AI music. We also have, you know, your varying feelings about different segments of the market. I would segment the market into generative AI music, so creating songs from scratch.

And then there are these suite of productivity tools that help musicians, creators, arrangers, do their job faster or allow them to focus on the truly human parts of music as opposed to the more mechanical parts.

We focus on that latter set, on a productivity tool for creators, learners, and educators.

Help me to understand the kind of technology that’s involved in something like Songscription. I’ve used this service a couple of times. It’s actually worked.

It’s possibly the best one I’ve seen yet when it comes to AI transcription. Basically, you can upload an MP3 file, or you can give it a YouTube link, and it transcribes whatever instrument you ask it to.

Right now, I think it’s limited to what piano, violin, guitar, maybe. I’ve used it on a couple of piano tracks to notate things, and it’s done surprisingly well.

But when people hear about AI, they are generally thinking of, like you said, the generative AI, the ChatGPT, and the text-based large language models.

So what kind of AI technology are you using, and how does that compare to the other types of things that you hear about in the news?

For sure. So there’s this sub-field of AI called Music Information Retrieval. And it turns out that taking an audio file and mapping it to sheet music is a really, really difficult problem.

One that is incredibly unique from the kind of problem that your standard LLM will solve. And that’s why if you upload a song to, say, Chat GPT, they do a really bad job at mapping it to sheet music.

It’s because the problem is fundamentally different. And the way that the models are trained are fundamentally different. So, actually, we in InHouse, so it’s a team of four of us, we InHouse have had to train our own AI models from scratch.

And that’s why, Garrett, it sounds like you tried some other services out there. It’s why ours produces fundamentally different results.

So does that mean you’re literally showing it a PDF of a score and a recording and telling it to connect the dots? Or what does that training look like?

Yeah, yeah. So our Chief Science Officer, our Head of AI is named Tim Bayer. He discovered this really breakthrough way of creating quite accurate models.

And the model is actually trained in two steps. And Tim Bayer’s paper was published last year, and it was presented at ISMRA, the International Society for Music Information Retrieval.

So for yourself or for any of the viewers out there that are more curious or on the more academic side, the paper is public. We’ve since refined the model and changed some of the underlying approaches or refined it.

But the overall approach at a high level is actually public information. And what we do, although to you, Garrett, it looks like one model, you upload some audio. It could be recorded on your phone, and it gives you the sheet music for it.

It’s actually a two-step process. The first step, we’ve trained a model that takes audio and turns it into MIDI. And the second step is we have a second model, it takes MIDI and maps it on to sheet music for that given instrument.

And just like any AI model, you have a lot of input and output, a lot of examples, and you run that through an algorithm, such that the algorithm picks up that if it hears this particular pitch, it maps to this particular MIDI output.

And then the second model, given MIDI output X, in the context of this greater scheme, then it maps to this particular note.

And does the AI, what do we call it? Do you have a name for it? Does your algorithm have a name?

We could call it the AI algorithm.

We don’t have a name.

You don’t call it Cathy or something? Okay, well, there’s something you could do. You can name your AI.

What should we name it?

Oh, gosh.

Yeah, listeners, put it in the comments. What do you think they should name their AI chatbot? But so the AI models, do they see audio as audio, or are certain types of instruments easier to recognize than others?

So we’ve trained our models.

Okay, it’s a very interesting question. Because when you look at it, Garrett, you mentioned that you use our piano model. And our piano model is by far the best because we focused on it.

So it’s a harder problem to solve. So we thought if we could solve it with our model architecture, then solving something like trumpet becomes a walk in the park.

There’s also, it’s a lot easier to find piano data to train on than say, data for the piccolo. There’s just more piano players out there. And there’s a lot more demand, a lot more people want piano.

Anyway, because our piano model is trained so well, you could upload a piano piece that was played in a cafe with a singer and a bass in the background. And the model is so good, it could strip out just the piano audio.

So the model effectively, when it’s trained to the quality that we’re proud of, and this isn’t yet true for guitar, flute and violin models, which you kind of need to upload just a violin solo, or they’re getting better, but they’re not yet at that

piano quality that we aim to train all models for. The piano model is trained to a point where, or is trained in a way that it recognizes just piano voice.

And it’s robust enough to understand that sometimes the piano is slightly out of tune, sometimes you’ll hear background noise and you need to filter that out. There are these stem-splitting models out there.

That’s not what we do, at least presently on the site. The piano side of things, or we’ve been able to train a model for piano, that’s effectively a voice recognition for what piano sounds like, and just maps what those sounds are to different notes.

So it sounds like it’s almost a manual process for each instrument and each style of music. Like you have to give it feedback and explain like, this is what jazz is, and this is what club music is, and this is what classical music is.

How do you, I know we’re getting in the weeds, but how do you give that feedback to the model? Is it like a yes-no thing? Is it showing it the correct answer when it missed something?

When you’re training AI models, you’ll have something called, or you’ll have the training set, and then the validation set.

And so we, and it was very astute of you to point out, we have to have a diverse set of data.

And one of the really like one of the hardest parts about training a really good, robust model in the space is that a lot of the existing models are trained very heavily on just classical music.

And that’s great if you’re playing Bach or even, even some of the more cotton-dry pop songs. But once you get into jazz or theater, you get a lot of these great flares.

And, you know, you think about your typical eighth note in jazz is just fundamentally different. It’s this tempo rubato and it’s very, very, very, very uniquely and differently written notation.

So you have to have those kinds of genres represented in the data set for the model to actually work. In terms of how it works fundamentally, you segment the data into a training set and a validation set.

Validation set, you think of as your ground truth. This is what the model should output given input X. The training set is what the model learns based on.

So it says, here’s input X and output Y. Would I have guessed that? Yes, great.

And it moves on and continues improving. Then you test the model against the validation set and see if it improves over time. And if it improves over time, then we realize we’ve trained a robust model.

And then we could publish that to the site. And so I mentioned the model architecture is unique. That is true.

The other side of it that’s unique is thinking through a data pipeline. And exactly what you said about how to label the data, what kind of validation set makes sense, how much of that set do we need to be jazz? How much do we need to be classical?

How much do we need to have background noise in it so that the results are good to be commercially viable? So someone like yourself could upload and be happy with the results.

And where do you get the training data? Are you working with artists to get recordings? Are you picking specific things out?

What does that process look like? I assume it’s, you hear about the LLM just scooping up everything on the internet and dumping it through their system. It sounds like that’s not what’s happening here.

Yeah.

We get data through a variety of different sources, and we also are able to create some synthetic data. We have, we’ve partnered, as you mentioned, with individual artists to help get their performance data.

We also have some business partnerships for music company, or with music companies that have helped us get data and also label data at scale. And then I mentioned also we have some synthetic data.

So if an artist maybe sends us one particular recording, we’re able to augment that.

You can imagine putting one version of it through the model, just the regular performance, and then artificially putting cafe sounds in the background and putting that same performance through the model.

And then maybe artificially putting birds chirping in the background, for example. And so that allows the model to be robust against various background sounds, even though it’s the same underlying performance.

So we aren’t creating new performances for two reasons. One, we want to empower artists.

And two, also there’s something uniquely human about playing music that any generative AI model has not yet, and we aren’t sure if they ever will be able to come close to mimicking.

So having that human performance allows the models to perform at the highest level possible.

How similar is what you’re doing to the optical music recognition or OMR companies that are out there? We’ve been promised that scanning your music and turning it into a MIDI file, we’ve been promised that for a decade. It still hasn’t materialized.

I’m almost wondering if you’ll get there first, right? Is there something about audio that makes it easier? Or does it sort of go hand in hand?

So there are actually two totally different fields.

And as a team, we’re a bunch of music nerds, we’re a bunch of passionate musicians and also tech nerds. And so the OMR field is something that we’re always reading up on, and we’re excited about the different developments there.

But it’s fundamentally a different problem. The AI model for us is trained, for example, we could just focus on piano, it’s trained specifically to recognize piano voice and then map that to sheet music.

The OMR models are trained, it’s more of computer vision and an image recognition requires a totally different model architecture, a totally different set of data. Because at that point, you’re saying input PDF acts and output XMLY.

In terms of why we chose to focus on music information retrieval and audio to score versus OMR, it’s because the audio to score is something that really relates to us. We aren’t mentioned that we’re passionate musicians.

We are, we’re also not musical protégés. And we just love, love, love the greats and we love to practice. And we realize that a lot of music is inaccessible for learners and educators worldwide.

I in high school, for example, I mentioned I played the Barry Sax.

I played in pit orchestras and I was given these, I was given these, I played in pit orchestras and jazz bands and given, given scores written for either the tenor saxophone, which was a pain in the neck to actually sit down and transcribe or right.

That was, was in, was in jazz bands and would hear this awesome, awesome solo that was recorded, I don’t know, three decades ago by one of the greats. I was like, how the heck do I play that? And there was no way that I ever could.

So we thought as a team, how could we, how could we solve it so that this generation of artists, you know, it’s the case 50% of children put their instrument down before graduating high school.

And our experience was a lot of the reasons why we, we don’t play as much as because the music that we want to play on our instrument at our level is inaccessible. So what if you can create a world where that’s not necessarily the case?

Or a composer, like when I was composing, I didn’t have to sit down and write by hand.

I think most people, myself included, operate under the assumption that automatic music transcription, it’s not a question of if, it’s a question of when. We all just kind of assume that it’s going to happen eventually, that someone will figure out.

Maybe it’ll be you, maybe it’ll be someone else. When that does happen though, it’s going to have enormous consequences on the music industry. I make a sizable amount of my income transcribing music for people, or arranging music that doesn’t exist.

There are numerous publishing companies that are putting out music that could be obsolete if anyone could just take a recording and throw it into an AI and get sheet music like that.

So, I’d like to hear from you, just your vision of what the industry would look like if everything went the way that you hope it will.

Writing down music is more of an art than a science, at least at that last 10% stage, right? And there’s no one right way to display a given piece. Even if you take a very basic example of common time versus cut time, right?

The piece has a certain feel to it. And AI models can get good at identifying it, but it will never be as good as a human because there’s such artistry to it.

What we imagine building is a tool and a platform that could get somebody 90, 95, even 99 percent of the way there.

But for the human arrangers and for the human composers out there, the tool could allow their workflow to increase rather quickly because they currently spend, or many of them that at least we’ve spoken to, spend hours re-create or going through

their composition and listening and figuring out what the heck they played. For the transcribers and arrangers out there, there’s a long tail of transcribers who are transcribing or arranging things that are rather basic.

Maybe instead of spending 80, 90 percent of their time writing down note by note, they could take on a greater volume of requests.

Because they only need to focus on that last mile delivery, on making sure that it is in common time, not cut time, and making sure that the arpeggio is linked and tied together.

That’s something that’s focusing on the more human parts of the song, as opposed to the more mechanical parts. It is our view, Garrett, too, that transcription is the bottleneck in the industry. It requires specialized labor.

It’s quite expensive, and it’s all done by hand, and such that if transcription were to be made more accessible to people, the demand for transcription would skyrocket. So we don’t see ourselves replacing human transcribers.

Our ideal world is one where human transcribers and arrangers could focus entirely on the more human musical elements of a song, and there’ll be more demand for transcription because more people realize that it’s just at the tips of their fingers.

So obviously, you’re early in the development of the company. But where do you see the future of the business? Are you aiming to become a major player in music publishing?

Are you hoping to license this technology to the existing players? Where do you see your role in the music industry going forward?

Yeah, so we are quite young, Garrett, and I think we are grateful that TechCrunch and a number of other major media outlets covered us in June.

As a team, we’ve been working together since only last December, and we all met, or three of us met, in a Stanford class called Lean Launchpad for Education.

And that class was that it empowered teams of students to focus on solving some problem troubling the field of education.

And for us as musicians ourselves, the education angle, the empowering, your amateur composer or your amateur piano player, for example, is something that resonates with our deep inner hearts. And so, that angle, longer term seems really interesting.

Think about the education components or how this tool could be used by music educators. Say, your average high school band teacher could get scores arranged for their, or at the level of play of each individual band member.

In a way, that sounds, so could get their band’s favorite songs arranged at the individual level of play of each of their band members. That kind of potential future really excites us.

And part of the problem that we could relate to, and also it’s a problem that we know that hundreds of millions of people around the world could also relate to.

In terms of what that looks like, and whether it’s a product that goes directly to consumers, or a product that works in tandem with sheet music providers, that’s an open question.

I’d say we’re too early, and we’re too new to the space to have a definite answer. But the goal of it about transcription for making transcription accessible, and then use and recognizing that transcription is the bottleneck to music education.

So transcription as a way of making music education more accessible, that’s been something that’s quartered our mission from day one. That’s what attracted us to take the class in the first place.

And it’s also something that we see the users coming to our site. A lot of them were looking to either learn or create songs.

I definitely see your point about transcription being a bottleneck. I’m on a bit of a sort of personal crusade, I guess, to get artists to release sheet music of their songs, right? That’s a revenue stream I think most artists ignore.

And it’s for the same reasons you say, right? Like, there’s so many people that want to perform these songs, and they are not always available.

And I think the reason they’re not always available is just because the cost-benefit analysis doesn’t weigh out, right?

Like, the time it would take to hire somebody to transcribe everything or to do it yourself wouldn’t necessarily provide a big enough of a return.

And so it’s interesting to hear you talk about hundreds of millions of people demanding this music, because we’re not seeing those kinds of sales in the industry right now. We’re not selling hundreds of millions of units of a particular song, right?

Like, not in the same way that you see recorded music being consumed. Obviously, that’s a whole different thing. But I do wonder about the implications for artists if this does pan out the way that you describe.

Because on the one hand, it would be great to have that music available. On the other hand, as an artist, you would lose control of how your songs are presented. If you’re publishing a score of your song, you can make sure it looks the way you want.

You can make sure that all the details are there correctly. Like you said, the art of it. You can make sure that is all the way you want it to be.

This would undercut that potentially, right? Because anyone could just take the recording and create their own version instead of buying the one that the artist provides. To me, that’s almost a moral question, right?

There’s sort of the copyright legal question over here, but then there’s also just the moral question of like, is that right for artists? And does that help artists? Or I don’t know, how do you view that tension?

Yeah, for sure.

And it’s something that we’ve spoken quite a bit about as a team. We’ve made a very conscious decision to not tackle generative music. That’s a very wonderful field.

And it’s one that we think is, there’s a ton of ethical and moral questions there.

And that’s why we’ve chosen to focus on this productivity tool that we envisioned, at least, as a way of empowering musicians, music educators, and music learners, instead of replacing anyone in that space.

What we see also on our site is a ton of users, a majority of users are uploading songs that fall in what we call this long tail of requests.

So because sheet music is such a bottleneck, and, you know, Gary, you mentioned the cost and time component of transcribing. I would actually add another major limiter in there. And this is, we think, is the core limiter.

It’s that it requires a very specific set of skills. And you have very famous musicians, the Beatles probably most famously, that didn’t know, don’t know how to read sheet music.

And so even if they did want to write it down, they quite literally couldn’t.

You know, when we think about the long term future of music and what streaming has done to the industry, it served, you know, looking at one of the most positive things that streaming has done.

For example, my Discover Weekly on Spotify is that it creates this platform that your indie artists or your up and coming artists can be discovered without having to go through the traditional way of getting lucky and signing record deals and hoping

that the label that they sign with, which has a lot of power over them, promotes them. And so if you imagine what you see as this long tail of artists that are finally being discovered, we imagine a future where that is true and also music

performance and music education. And now sheet music would only make sense for your most famous artists to have their sheet music transcribed because everything is done manually and you’re right, it’s so expensive and time consuming to write down

just one given song. And those songs, by the way, the ones that artists are writing down and they have control over or fully and are up on websites and distributed by HowlEnter, those are, that quality of sheet music requires that last 10% of human

intervention. And so again, we aren’t making a play or we don’t see ourselves necessarily as coming in in direct competition.

We’re here, our tool here is really good at giving a really good first pass at the sheet music, which is particularly valuable for the hundreds of thousands of artists that don’t have their music transcribed because they can’t afford it today.

Use our platform for their own compositions and then sell it on their own.

And or for their groups of tens of thousands or hundreds of thousands of fans that are looking for another way to connect with them and performing music, we found and even told from the artists that we work with is one of the most profound ways that

Do you think copyright law needs to change to accommodate what you’re trying to do or what you envision AI music recognition to be in the future?

I first by saying, I’m not a lawyer and I’m not a legal expert.

So I, and then the should question is also outside of like outside of my scope. I don’t want to, I don’t want to say that laws ought to change or ought not to change. We want to make sure that artists remain empowered, that they remain in control.

And that music, we, we’re just building tools where musicians or the entire industry, so music learners, music composers, music educators could focus on the human part of playing music as opposed to the more mundane part of sitting down, writing it,

transposing it, all of that. Should copyright laws change? That’s, that’s not my view. I don’t want to speak for other people in the space, but I know that the copyright laws exist for a reason, and that’s to protect artists.

And that’s not something that we at Songscription ever hope will change. We are also aware that the question about AI and copyright is, is one that is open, at least in the in the field of or in in in the current court system.

You have the Anthropic case, for example. Even you have you have the famous YouTube case with that established the DMCA.

And then this current administration’s policy is they’ve recently released an AI memo, seem to be more in favor of fair use for AI. But that’s an open legal question, and whether or not it should be, I’m not an expert on.

I think whichever world allows artists and musicians to to feel truly empowered would be one that I’m behind. And I know the the copyright laws today exist to protect artists.

Well, and the reason I bring it up is because within copyright law, there’s very different ways of applying it to different formats of music, right?

To get a license for recorded music is very different than to get permission to do a printed arrangement, for example. You know, the language of the copyright law is so archaic.

I mean, it’s talking about phono records and stuff like that, you know, and it’s just so, it’s in such a different universe from the technology that you’re describing.

I guess the question ultimately is, you know, is it is the right approach to have sort of courts define how this stuff is applied to new technology, or is it better to just create new laws?

You’re actually teaching me things, Garen. I admittedly, I haven’t read the exact text of the law.

And we’ve been working really closely with lawyers from a ton of different backgrounds, people who are veterans in the music industry space, and people who are also veterans, well, as veteran as you can be in the AI and start-up world.

And it’s been really cool to see how the different, how those two different fields have evolved, even in the past 10 months, that we’ve been building Songscription.

Your question’s really making me think about whether courts are the best arbiter for the future of music and music rights. I think pragmatically that’s what we’re seeing is happening in the AI field because it’s such an evolving question.

It would be wonderful, I think, if artists, composers, creatives were able to be more proactive in shaping policy.

What we’ve seen and we’ve experienced this a ton, Garrett, musicians or folks in the field hear AI, and they get this very, many folks have this defensive reaction because they think of your Sunos and your UDOs of the world that are generating new

music, and they feel this direct threat. And for us, it’s a little bit sad because we’re not in that space. We are an AI company, but we’re more of a productivity and an education tool platform.

Longer term, the education application is really fascinating for us, not the generation.

And so one world that I could imagine, is you have artists come together with publishers and folks in the industry, and they also widen their circle to include this up-and-coming generation of productivity-oriented tools.

And together we could work because then Songscription and folks, you mentioned OMR companies, could work together with the creatives in a way that allows them to create more work and more high-quality work faster.

And I think there’s a lot of interesting synergies that could come out there in creating policy potentially that is inclusive of these kinds of this next generation of technology, and protective over artists’ rights around the generative side, as

That’s a perfect segue into my final question, which is just how, like literally, how can musicians prepare for this future that’s coming when we don’t understand the technology?

Like how do we learn about it? How do we get up to speed? Are there places where we can find developments as they happen?

Where are these things being discussed? It just feels like such a huge unknown. And the music industry has a long, proud history of resisting change, and ignoring change, and fighting change.

And I don’t know, this time feels different because the change feels so drastic. Or at least the potential for change feels so drastic to what we’ve seen in the past.

And I know a lot of us are just sort of holding our breath, waiting to see what happens. But if there’s a way to see what’s coming and sort of future-proof our business as musicians, I think that would be really valuable.

So what would your advice be?

Yeah, one is reach out. This is to yourself and to anyone on the podcast. My email is andrew at songscription.ai.

This is something that my team and I are really passionate about. And so we’re happy to have conversations with anybody. In terms of other ways to educate, I mentioned we have Tim Baya, our Chief Science Officer, whose paper is published.

And there’s a whole field of music information retrieval that has a number of academic papers published. Now, those are above my pay grade. And I had no idea about the space.

Even five months ago, my team has had to do a lot to educate me. And I’ll say that Chat GPT or your LLMs are really good at summarizing articles in layman’s terms. And that could be a really great at-home education tool for anybody.

Upload some kind of a scientifically published article in this space on a particular subtopic that interests you.

If it’s optical recognition or if it’s audio to sheet music transcription, and then at least the prompts that I give, because my brain is probably at that level, generally speaking, please describe this for a four year old.

And it does quite a good job there. And if that’s not enough or folks would prefer more of a live conversation, that’s something that we talk about every day. And I love these conversations.

Well, thank you for having the conversation with me.

It’s been very insightful to hear what’s going on. And I appreciate the openness and the thoughtfulness that you’ve put into this. Where can our listeners find Songscription?

For sure.

And thanks for having me, Garrett. So come visit our website at songscription.ai and reach out either at Andrew at songscription.ai or songscription at songscription.ai. We’re happy to hear from all y’all.

Thank you.

Well, thanks again. And we’re excited to see how this all develops.

Cheers.