Ep. 66: ScoreCloud: Bridging the Gap Between A.I. and Live Performance, with Johan Rönström

Episode Description:

Johan Ronström is the project manager of ScoreCloud, a company which has been described as Google Translate for Music.

Their unique approach to AI music transcription is focused on human performance, and their ScoreCloud studio lets you write and arrange music by playing into notation.

The new product ScoreCloud Songwriter can separate vocals from instruments in a single recording to create automatic lead sheets with melody, lyrics, and chords.

It was great to get another perspective on the ways artificial intelligence is changing the music industry and I think you’ll get a lot out of this conversation.

Featured On This Episode:
johanronstrom_profile_wide
Johan Ronström

Johan Ronström is the project manager of ScoreCloud.

Episode Transcript:

*Episode transcripts are automatically generated and have NOT been proofread.*

Johan Ronström is the project manager of ScoreCloud, a company which has been described as Google Translate for Music.

Their unique approach to AI music transcription is focused on human performance, and their ScoreCloud studio lets you write and arrange music by playing into notation.

The new product ScoreCloud Songwriter can separate vocals from instruments in a single recording to create automatic lead sheets with melody, lyrics, and chords.

It was great to get another perspective on the ways artificial intelligence is changing the music industry and I think you’ll get a lot out of this conversation. Johan Ronström, welcome to the show. How are you doing?

Thanks.

I’m good.

Could you please tell our listeners just a little bit about yourself?

Yeah, sure. So I’m the Product Manager of ScoreCloud, and I went to the Royal Music Academy in Stockholm, Sweden, and playing the mandolin, studying traditional Swedish folk music.

And there I was recruited by one of the teachers, who is a music theory teacher, to work at ScoreCloud, which was just started then. This was seven years ago I’ve worked here. Yeah, I worked here for seven years.

So yeah, it’s most of the people at the company have, like a practicing musician background. So that’s, it’s a really interesting place to work.

And I know ScoreCloud has a couple of different products, but if you were to summarize, what is the main thing that the company is trying to do? What would that be?

2:48

Core Mission & Products

Yeah, so the main thing is that we interpret audio and MIDI and generate music notation automatically from that.

So the core technology is that we listen to what you play, and then we try to figure out which notes were this, and what key is it, and what time signature is it, and where are the downbeats, and then we can generate written music from your played

What are the different products that ScoreCloud offers?

So the ScoreCloud products is our big program, ScoreCloud Studio, which is the desktop music editing.

It is the most, the more powerful editor. And there you can play your instrument or sing a song, one instrument at a time, and we transcribe it, and you can overdub yourself with audio.

So you can sing your song, first the melody, and then the second voice, and third voice, and then you have the sheet music finished. And then there you can also edit, like styling and margins and all of those visual tools.

And then from that, we have made a version of this that’s called ScoreCloud Songwriter, which focuses on vocals and instrument at the same time.

And this is for singer-songwriter type musicians, where you sing something and you play one or two instruments, maybe a guitar, a piano, and a bass, and then vocals.

And then there we can separate the audio of the vocals from the backing instruments, and we can transcribe into a lead sheet or a chord sheet instantly.

So you can input either your previous recordings or just record into the program, and you get a lead sheet just from the recording. And then we also have the app for iOS, which is ScoreCloud Express, which is more of a notepad style.

You record your singing mostly or whistling or a single instrument, and you will get just the melody line written down.

And all these programs sync, so if you record something in the app, it will just appear in the desktop program, and you can keep editing and overdub and stuff on Windows or Mac OS.

And then we have a web interface where you can share and play the results of it. If you want to send a link, you will get the sheet with the audio and the synced playback of the sheet music and the audio in the browser.

5:26

AI Technology Explained

How does the technology work?

Is it an AI-based program? Is it some other system?

Yeah. So it is a combination of different types of AI. The core technology of the first versions of the company, like 10 years ago, was way before the language least trained neural network AI models.

So this was a music cognition-based AI, which is rule-based. So we went through, like what are the notes in the song and how would a human interpret this?

So that it was not like one of these new trained AIs, it just rule-based as how the human would hear the music and this is a system developed by our founder Sven Ahlbeck, who studied this for many, many years at the Royal Academy of Music.

It started as a research project in like how human can, how can a human understand the music we hear? How do we know where the downbeat is? And after just five or eight notes, we can say like this was the root note of the music.

How does that work? So he wrote a book about this and then created a demo computer program to illustrate this.

And the book was very thick and very hard to read, but the music program was like, all the musician friend was like, can we get this program? So then the company was started to develop this program further.

And then in addition to that, we have added new neural AI models to, mainly to transcribe audio into MIDI, so that the, our model can go through the notes in the MIDI and make decisions about the musical context.

But so now in our current program, there is an AI model that separates audio of vocals from the audio of instruments. And then we have two different AI models that can decide or transcribe the notes of the melody and the notes of the accompaniment.

And then we have a third that can transcribe lyrics, and then everything gets fed into our music understanding, the music cognition model that produces the sheet music from this. So it’s a combination of many different types of AI.

And then on top of that, we have the editor with all the editing tools, but that’s more mechanical, more manual.

8:04

AI Distinction & Impact

So one of the things that I’m trying to figure out is, like you said, the ScoreCloud’s been around for 10 plus years already.

I actually first tried it 10 years ago. I had to go look up my password to log in and mess with it some more before I talked to you.

But we hear all this stuff about ChatGPT and all of the AI advancements, and I feel like there’s new companies all the time that are popping up around AI and music technology.

And I don’t know that it really matters, but I guess I’ll ask you, do you think the distinction between this is an AI music tech and this is a music tech, do you think that’s an important one? Like, what does that do?

Yes, I think there is a distinction. And for us, I think that we have tried to stay away from the generative AI part of this. Like, we don’t have functions that write music for you.

We try to be a tool that transcribes what you, the ideas you already have, to get them onto paper, like faster and more efficient. So, and then we do have tools to, like, get automatic chord symbols for your idea, or like, randomize different styles.

But for me, the biggest difference is that the music comes from you. And this is a tool that replaces you writing it down with a pen. And it doesn’t replace you coming up with the music.

I struggle a bit with how I feel about the generative AI creating music. And I feel that we are enough away from that to feel good about our program.

I would say that I share a lot of the same feelings around AI. And especially when it comes to music creation. I have no problems with AI as a tool.

But wouldn’t you say that this kind of technology only makes that future more certain ultimately?

Because the better MIDI and audio technology develops, the easier it’s going to be for these AI companies to start using that technology to create its own music.

I am not sure in what way that would be. So for me, it’s not much different, the transcription tools we have from like hiring a transcriber person, which is a job that we might take away from. Like we might be a part of removing jobs from that.

But for me, there is still so much material of like midis online to train data models on.

And we are actually struggling to find music that is transcribed correctly online enough, so that’s like these the tools that try to generate music, they make most of those are having to rely on how the music sounds.

As you can see with like Shazam, for example, which doesn’t know anything about the content of the music. It just matches frequencies.

Whereas we detect like the actual building blocks of the music and the AI models that generate music, it doesn’t, I don’t think that we make a big, have a big part in changing the environment around that because they are training on big sets of audio

and big sets of midi, and those already exist. And if you’re a musician, try to get your things on to paper.

We might help generate like a bunch more midis, but these, the people, our customers who wants to get an idea into midi or onto paper, they would have done that in a way, just using some other tool and the midi file would exist, and the audio file

would exist. So it’s just like the steps from having your idea in your head to getting it onto paper or getting a demo audio file that you can send to, like, adding a guitar track to your idea. So I’m not sure it makes that big a difference in that.

Yeah, well, and like I said, part of what I’m trying to do with this interview and with other people I’m talking to is just figure out how all of these different technologies are going to change things and how they’re going to work together.

We definitely live in that landscape.

And when we talk to, I don’t know, customers and schools and investors and stuff, when we mention AI, there’s like either it’s a spark of interest or a spark of hatred, almost like there are a lot of there is a lot of opinions about AI and we are in

that world. But I feel like for me personally, I don’t feel conflicted about what we do the same way I do about the generative, which I still thinks is interesting, but I’m undecided about that part of AI still.

So if all of your research dreams came true and the technology worked exactly the way you hoped it would, what would that look like and how would that change the music industry?

So for me personally and also for the company, the big dream and what we’re working towards is that many people can read music enough to play their own instrument, but from that to write music, the step is so big.

If you play the trumpet and then you get to sit down with an empty sheet of sheet music paper. Just the first steps of starting out is so hard. You need to understand all the symbols and all the…

But you can still have the… You will still have the idea of the musical idea.

So for me, helping those people, which is also me, I’m a traditional musician that play by ear, to have an idea and then getting it correctly into sheet music and coming…

Like moving past the step of having to understand all the nuances of the sheet music in the first step. I think it’s interesting to learn that.

But I tried to think of our things as MS Paint or drawing crayons, where the typesetting music notation programs that are from the past are like the photoshopped of this. There is so much things you can do.

And just having an idea and getting it, like being able to share it as written music without all the hurdles of setting up things and needing to understand what key this will be in before you enter the…

Like I just want to play my things and have it on sheet music. I don’t want in that, when I’m in the flow of composing, I’m not in the mode of like, this is the flat.

And like, I will figure that out later, but when I play the music, I just want it to be written down. So that’s my dream to be able to…

That we can transcribe correctly enough for all of the people who have musical ideas to get them on to paper without having to go to the Music Academy first. They can… You can do that later if you…

If you… Yeah.

Well, and if I am understanding just sort of the environment correctly, the thing that makes ScoreCloud different from other companies working on AI transcription is your focus is on individual performance, not taking recordings and analyzing

recordings. Is that correct?

Yes, that is correct. We do have… We do have a model that can work for some, like, existing recordings.

If you try to import, like, a Beatles song or, like, Adele or something that is clear, like, audio, like, not a lot of effects and, like, usually one or a few vocals and one or a few acoustic instruments, that works fine with our ScoreCloud

Songwriter program. And you will get the lyrics and the melody line, and you will get an accompaniment piano with chord cymbals. But that has not been the focus. It’s to help, like, musicians to transcribe their own things, mainly.

It reminds me of when I was studying composition.

I had one professor who would always go off about notation software and how it was terrible for composing because it pushed you in a certain direction. He would say, you always need to use pencil and paper.

That way you’re controlling the music and not the other way around, right? It’s interesting to hear you talk about this because at a certain point, maybe ScoreCloud becomes the answer to that, right?

Where you don’t have to think about the notation and you can just compose, however it is that you do that, right? I think about a lot of composers are probably like me and have a voice memo app full of random ideas.

And it all sounds like, you know, it just sounds like nonsense. And if you don’t go back to it soon and actually write it down, you forget what it’s for, right?

It’s like I have, I have, I went through actually last week and I had a bunch of song ideas from like three years ago. And I’m like, I have no idea why I even recorded this. This is not any good.

So I think, I think ScoreCloud could potentially be a great solution to that.

But I will say when I was trying it out, when I was experimenting with the app, it had a much harder time with me singing stuff than when I was playing my instrument, right? I picked up my trombone and I played some things.

And I think because, I mean, I’m a much better trombone player than I am a singer, right?

But because I think the pitches were more defined and the rhythms were more defined, it did better with that than with me just sort of mumbling, you know, dooby doop bop, doop bop bop bop or things like that.

Yes, definitely. Yeah. Yeah.

And that is, it can be a learning curve, which also is a dream that we would be able to transcribe all of that correctly. But for me, we usually say like, if you sing this to another human, would they be able to transcribe it correctly?

And if not, we would probably not either. That’s fair, that’s fair.

And one of the tips from the human, like the cognition research that we have done, is that obviously more precise pitches and sliding, not doing ornaments and stuff helps, but also to record a longer thing.

If you record like two bars, it’s very hard to definitely say which time signature this is, or which downbeat this is.

But if you sing eight or 16 bars, it becomes exponentially easier for us to see, like this is a pattern that comes back, which means that this is probably the downbeat, and this is probably the time signature.

If you sing it into the app, as you would sing it for like teaching someone the song, perhaps a little bit slower, perhaps a little less ornaments, and sing it maybe two or three times, it’s much, much easier for the app to understand where the

downbeat is and what the key signature is. And since the app also wants to be agnostic to musical styles, we don’t want to say like, this is probably four, four time, because most music is four, four time.

We want to be able to also detect if it is seven eighths or 11 sixteenths. So if you sing a very short thing, it can be quite hard to transcribe the rhythm correctly.

So would you say that the difference between ScoreCloud and other AI models, by the way, does your AI model have a name?

20:55

Handwritten AI Model

No, we generally don’t talk about it externally because it’s so boring.

You’re not anthropomorphizing your AI model.

No, no, we try to be more…

It’s not Jessica in the back transcribing music.

No, we try to be more human.

We call the… Like in the program, we have what we call a listener, which is like you sing and play into the listener, which listens to you and then generates an output of what it thinks.

So in the beginning, we were talking about our musical assistant and more human words like that. But yeah, since our user base is mainly musicians rather than technology people, we try to be more… I don’t know, not as technical.

Sure.

I think you said this earlier, but is the approach to teach the AI to understand the music or is it just… How do I want to say this? Take, for example, in music notation, right?

There is been, especially in recent years, you see the difference between notation software that understands what the music is versus something like Finale, which was essentially a graphic software tool, right?

Is a similar thing developing in AI where some models are understanding the music itself and what the music actually means, whereas others are just focused on this audio, translates to this MIDI note?

So I like that there has been, regarding a thing you also talked about earlier with your composing teacher, that there is definitely a market and a use case for people writing by hand, but into a computer, like the Dorico and the Finales of the…

And that is a composition style that is valid. Of course, people will keep doing that.

But for me and for many people I know where you are a musician first and a composer second, just playing your instrument or singing your thing comes before writing it down.

And regarding the understanding the contents of music, there isn’t that many models that actually understand what the music does or means.

Many of the AI models, both the generative and the transcription models, they generally use like audio frequencies, and they match that to a known database of frequencies. And they can see like this was pretty close to this other thing I trained on.

And therefore, it probably should be written down something like this. But that is not at all what we do. We use a…

Like our approach is to try to teach the computer to hear the music as a human would hear the music, and then try to write it down as a human would write it down.

Which is a different technical approach, because we need to get the notes, we need to detect all the onsets of the notes and how long they are and what pitch they are to, to then use the cognition model we have, so that we can see like if a human

would hear a C followed by an E, then we can say like the C is probably the root in like in compared to the E, because of the like those kind of human listening types of learning. So we don’t compare to existing songs on the internet that most of the

other models do. Which also means that we can detect like this is an intro, and this is an A part, and this is a B part, and this is the B part again, although the melody was different and the chords were different.

We can still see that this is like the form is close enough that we can draw conclusions that many of the other models that are audio-based can’t.

So does that mean you’re training more with live musicians, or is it still recording based?

No, so the audio-to-midi conversion step is trained on live musicians, where a neural network is trained on audio files where a human has said, like, here is a note that someone has annotated the audio file, and then we can train on that.

But the actual music understanding engine is not trained at all. It’s written by hand based on the music-ognition research. So that’s not a trained model at all.

That’s purely research-based like a handwritten AI.

There’s a handwritten AI?

Yeah, yeah. So yeah. And if you’re interested in like how that works, Sven Alvek, he’s the book that is the research project that the company is like based on.

That is a book you can buy to try to drive, which is, it’s just like, how does humans understand music? The book. And that is what the algorithms are based on.

And then we have many others like smaller, like for example, when you have transcribed a melody and accompaniment, we try to analyze like maybe part by part to get chord symbols from that. Like, is this a seventh chord or is this diminished or…

So there are many combinations of different AI models, but the main one, the music understanding one, is not a trained model at all. It’s a bespoke AI engine.

When you say an AI engine or a neural network or all these other words, I don’t really understand. Are you talking about a code that the software is based on or is it something else? Assume that I know nothing about technology and explain it to me.

The AI that is big right now, the generative AI and the large language models and neural network and all of that, that is basically like you give the computer a lot of data and then you tell it to draw conclusions and it goes through the data and you

see if it makes the correct conclusions or not and then you train again and you train again. And the result of that is like a black box where you send in audio.

In our case, we send in audio and we get out like a list of midi notes that the black box thought was in the audio. But what happens inside the audio, no one can see what happened. Like we don’t know why the computer drew those conclusions.

We can just see if it was more or less correct than the last time it did it. And that is how all the audio to text, like the series and all of those work. They have just trained on loads of speaking and text data.

And that is very good for some things. For example, transcribing audio to midi. But for getting the midi notes into sheet music, where you have to understand context and cultural norms for how music is generally written and all of those things.

Those kind of black box models are not good at all, because if the computer draws the wrong conclusion, we have no idea how to fix it. You can’t go in and tell the computer to just do this thing better.

It’s trained on all the data, which is millions and millions of files. In that area, our handwritten model is much better at drawing the correct conclusion. And if it doesn’t, we can just go in to the part where it looks at what should this clef be.

And we can see, here is the notes the model saw, and here’s why it decided that it should be an F clef or whatever. And we can just tweak the model to say, like, no, if it looks like this, a human would probably use an Octava G clef or something.

I don’t know. And then we can change that. And that is much more suited for this kind of more human-like understanding.

And I think we have an advantage that we have that. And Sven, who wrote this book, did this research. There is not a lot of other companies that have this base research.

So in our case, it is a lot of 10 years worth of code that is just our code base, where we send in a bunch of MIDI files or a bunch of MIDI notes. And we can see how if this generates sheet music.

30:30

Future Vision & Research

So to wrap things up, what would you say just taking a step back and looking at music technology research in general?

What would you say are the most important or interesting lines of research being looked at right now? What do you think is coming that maybe we haven’t thought of yet that’s going to change the music industry?

So for me personally, I mainly look at the research. So when I feel like the research is being made for people that write and play music, that’s when I get interested in the research.

There is a lot of research being done with AI and stuff that is just like, we generate music for video games. And sure, that can be super cool.

But for me, the things that help actual musicians and composers to express what they have already inside them, that’s where I, that’s my passion. So I get much more interested in the, and that’s also why I work here at ScoreCloud.

So like, if we can, I don’t, I can’t think of anything right now that is revolutionizing this, but there are a lot of incremental small steps that I feel is making the, is making this more democratized or like, more, like if you have an idea, you

should be able to make it into a thing and then put that out there and have people play it or sell it or however. And getting your music onto paper or like out there on, into the world has been, the step to get there has been so high historically,

like you had to be, I don’t know, hired by a German king and getting the resources to be able to like learn how to write sheet music. And now we’re getting to a place where even if you don’t have a computer, or even if you don’t own an instrument,

like you should be able to like create the thing you want to create. And the Internet has been a blessing and a curse for that, because yeah, there’s so much stuff being done and you get drowned out.

But still, we try to enable people to be able to get their things out there.

And the research being done on that, like I’m looking forward to the day where handheld devices, for example, is good enough to run our core code, because right now we have to do it in either in the cloud or on desktop, because it takes too much

computing power. So like being able to record and transcribe and create sheet music and then send that and have an orchestra play it or sell it as sheet music.

Like the incremental steps to reducing the hurdles for each, that’s where I’m looking at solutions. And I don’t think that there is like a single or like there isn’t an innovation that’s going to solve that across the board.

There has to be like so many. We have to focus on the correct things and then solve iteratively, like how to get people to be able to do this.

And as an aside also, I feel like the much of the technology development is focused so much on, of course, on commercial interests, because it takes money.

But I’m also like the more we can distribute these technologies to those who can’t afford to be part of the tech world or isn’t interested in tech at all. Like it has to be easy enough and available enough to so that anyone can use these tools.

Well, I think that’s a great place to end it. Thank you, Johan, for coming on the show. And everyone go check out ScoreCloud.

Thank you.

Thank you for listening to another episode of Selling Sheet Music.

If you like the show, please rate and review it wherever you listen to podcasts. You can read episode transcripts and get caught up on past episodes at sellingsheetmusic.com and you can find my music at garrittbreeze.com.

Selling Sheet Music is written, produced and hosted by me, Garrett Breeze. Post-production for this episode was done by Brandon Haney, and our theme music was written by myself and David Dykstra. I’ll see you next week.

Go write something.