Analyze One Year of Radio Station Songs Aired With SQL, Spark, Spotify, and Databricks

Note: This post was originally published on Medium

Whenever I drive or code, I listen to music, as this happens a lot, and in order to find new songs, I listen to the radio or I listen to Spotify’s discover weekly playlist, which made me like Mondays (because they release it every Monday)

A french old-school institute called Mediamétrie analyzes radio stations’ songs. Since I have seen their study (that I can’t find anymore) some years ago, I have been obsessed with creating my own.

This article will present the year 2016 for 4 main french radio stations through fun SQL queries, then we will connect each song to the Spotify API to create the radio stations’ musical profile.

We will use the Databricks community version to visualize our data. All SQL queries and all results are available on this notebook. It’s the “backstage” of this article, where the magic happens if we can say.

Protip: don’t miss the bonuses at the end of the article

Radio stations introduction

We all have a favorite radio station, mine is Radio Nova for their diversity, their humor, and as a hip hop fan this is the only national radio where we can hear listenable hip hop songs

Radio nova had 1,4% of the audience in September 2016 (PDF to download from Mediametrie).

In order to see how a radio becomes number 1, we are also going to analyze the number 1 music radio called NRJ who has 10,8% of the audience and 2 others : Virgin (5%) which, we’ll see, sounds like NRJ, and Skyrock (6%), don’t mind the name it’s a rap radio… haha

The main question is, after we compared these radios, should we give to Radio Nova the tips of how to be the number one based on NRJ’s analyze? What do you say, Nova? Learn from the best, right?!

Getting the Radio’s songs data

In order to extract the songs lists, artist, song title and timestamp, we are going to parse each Radio “What was this song?” HTML pages, except for Skyrock which has a handy RESTful web service.

Every song extracted will be converted into this Song class to query them easily with (Spark) SQL:

    Song(timestamp:Int, humanDate:Long, year:Int, month:Int, day:Int, hour:Int, minute: Int, artist:String, allArtists: String, title:String)

In 2016 300K broadcasts were collected:

  • Nova : 95K broadcasts of 5000 different songs
  • NRJ : 50K broadcasts of 800 different songs
  • Virgin: 60K broacasts of 1200 different songs
  • Skyrock: 100K broadcasts of 1000 different songs

Every songs is stored in a parquet format to extract only once the data (you’re welcome radios servers :p) and to speed up SparkSQL queries. Btw, if you are interested by the file I can export it to you in CSV, or parquet.

Remember that the best way to speed up, the Spark doc says often by more than 10x, queries, if you have to use the same SQL table (or Dataset/Dataframe) again and again, is to cache them in memory (Thanks Databricks for the 6Go RAM server!) with the dataframe.cache() method.

Let’s dive into our analysis now !

How many songs by day?

Some days were not recorded by the radios’ history system, so the real numbers should be a bit higher.

Fun to see that both radio stations broadcast more songs during summer (if we do not take in consideration the one-week bug of Radio Nova, in blue, in August), this is certainly due to summer holidays. They do a good job all year long, so, it’s OK to take some days off, I guess!

We can see that Skyrock and Nova broadcasts the same number of songs each day, whereas NRJ and Virgin a bit less, certainly due to more talk shows or untracked DJs night shows.

How many different songs by day?

The real difference comes from the number of different songs played, see by yourself the number of different tracks per day:

More mainstream radios such as NRJ, Virgin and Skyrock top 100/120 different songs a day whereas Nova is more about 280. If you want to discover more songs, it’s clearly on Nova.

How many different songs by month?

If we have a look to the monthly different songs, the gap between radios is even bigger.

Top 10 played titles by each radio station

It’s interesting to see how “hits” are played through the year.

We can notice summer hits:

We can tell that the most broadcasted songs are mostly aired during summer. So artists, play smart here and release your songs between February and June to have more chance to become number one, or to have more people hating your music because they heard it too many times?

Nova

NRJ

Skyrock

Virgin

Percentage of music by day

If we take the average broadcasted songs by day and the mean duration of a song, 3.30 minutes, we can guess the percentage of music by day. The other percentage is likely to be talk shows, advertising or untracked songs.

To understand more these percentages, we should see what a normal day is for our analyzed radio stations.

What is a typical Monday for our radio stations?

Let’s have a look to the average of number of songs for all radio stations for Mondays

We can distinguish 2 gaps, during the morning and evening shows for every radio stations. Amazing. More seriously, no discovery here, it’s a known fact that most radios have morning and evening shows during which there is less music and more talk.

Advertising time

If we recalculate the average percentage of music at noon, when there is no shows for all radio stations, we can estimate the percentage of advertising by radio by hour. We estimate that the radio hosts speak 5 minutes during the whole hour. We have to note that radios may advertise more during prime time when they have a larger audience.

For 60 minutes, we get from 7 minutes of advertising time, for Skyrock, to 15 minutes, for Virgin. In details, we have this table:

Radios brainwashing?

An annoying feeling we have sometimes with radios is we keep listening to the same songs over and over. As we are men and women who believe in science and not in our instinct we are going to use basic statistics to verify this weird feeling.

How many times is the same song aired on the same day?

These pie charts below tells us a lot about radio stations’s habits, more mainstream radios such as Virgin, NRJ or Skyrock are more about to broadcasts the same songs multiple times.

When is the next time we will listen to the same song during the same day?

Again, the most mainstream radios, NRJ, Skyrock and Virgin tend to broadcast the same song most often 2/3 hours since it was first aired. Nova’s value is more about 7/8 hours.

While we have different distribution, the average for our 4 radios is between 7 and 8 hours.

How many new songs* are added and when?

*“New songs” means songs that are not yet broadcasted in 2016.

If we look at the average after April 2016, we see that’s Nova is ahead, but don’t forget Nova plays 2500 different songs each month, so it’s normal statistically speaking

New songs are distributed equally along the week for all radios

Common songs between radio stations

On the table below, we can see NRJ has 25% of common songs with Virgin and 12% with Skyrock.

Virgin has 18% with NRJ while Skyrock has 9% of common songs with NRJ.

Nova has a few similar songs with the others radio, there are mostly legendary artists such as Bob Marley, Daft Punk, Aloe Blacc, Kavinsky, Beyoncé… If you are interested by the full list look for the “Similar songs between radios” cell in the “backstage” AKA the blog article’s Databricks notebook

Our 4 radio stations are different, for sure, but do they have common songs between them? Surprisingly the answer is yes

I would classify these songs as songs that everybody likes, you can play them at your party without any stress of being booed

If we use a visualization for our previous table it will look like this, the blue bar is the similar songs, the orange and the green bar are the total of different songs.

What are the secrets to be #1?

We have analyzed 4 radio stations based on the artist name, the title name and the day and time the songs were broadcasted. Beside letters and numbers, these 3 values mean nothing, if we want to make a deeper analyze we have to learn more about the songs played: how popular is the song right now? what is the genre of the song? How many followers does the artist have?

Hopefully, by connecting each song to the Spotify API we will get a lot of data we can play with :

https://api.spotify.com/v1/search?q=ARTISTTITLE&type=track&limit=1

In 2016, we have collected 8000 different songs from the radios, so to get the artist, the track and the tracks’ audio features from the Spotify API we have to make :

Number of songs x (Artist + track + audiofeatures) = 24K requests

That’s a lot. Plus, Spotify has a limit of request in time, so we have got to do it slowly, 20 request every 2 seconds, why not you know.

BUT, with this slow rate one thing I didn’t plan is we could see the number of followers change when we requested a song’s artist, as most artists have multiple songs been broadcasted, the artist information was asked from twice to 10 times. No problemo, right? No…This will mess up our SQL join between artist and track data later just because the DISTINCT on artists information were fake due to followers.total

I have to say this led me to craziness, because I had more songs after my join than before haha

Songs Popularity By Radio

Definition by Spotify

The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

No surprises here, mainstream radios NRJ, Virgin or Skyrock, tend to play more popular songs, that’s why I use the term mainstream, clever, right?

But the real question is : was the song popular before it was broadcasted on the radio?

Audio features

The Spotify API gives audio features extracted from the Song’s soundwaves, thanks to these we can display a musical profile of each radio:

In my opinion, the most meaningful audio features are :

  • danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
  • energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
  • valence describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Let’s see their average and their distribution, as a average alone can be sometimes misleading, among the radios’ tracks. As Nova got more different songs than the others we are going use percentage to compare our radios to add more context to our stats.

If you have read my Facebook Interview Journey, you know this is where I failed during my SQL interview, this code is specially for you, dear Mr. Interviewer, no hard feelings though :p

SELECT ROUND( (COUNT(t.*) / subTotal.total_radio * 100),2) AS percentage_of_songs,
      subTotal.total_radio,
      ROUND(popularity / 10) * 10 AS popularity,
      t.radio
FROM AudioFeatureArtistTrackRadios t
JOIN (
    SELECT COUNT(*) AS total_radio, radio
    FROM AudioFeatureArtistTrackRadios
    GROUP BY radio
) AS subTotal
ON subTotal.radio = t.radio
GROUP BY subTotal.total_radio, ROUND(popularity / 10) * 10, t.radio
ORDER BY popularity

Energy

Mainstream radios tend to play more energetic songs, I guess there are more easy to listen to? Some example of song with a lot of energy are We Are Your Friends - JUSTICE, Steppin’ stone - Davy Jones, and of course, the classics from the classics Jerk It Out – Caesars, I’ve first heard it while playing SSX3 on GameCube 8)

Danceability

This chart tells us both radio broadcasts the same kind of danceable songs. Some example of danceable song are Trick Me – Kelis, Around the world – Daft Punk or Anaconda – Nicki Mina

Valence / Positiveness

Same as the danceability, both radio broadcasts the same kind of positive tracks. Some examples : September – Earth Wind & Fire,Ska-Boo-Da-Ba – The Skatalites or Hey Ya! – OutKast

2 others interesting data, which are not Spotify (Echo Nest) specific, are the BPM (beats per minute) and the songs duration

Tempo / Beats per minute

Duration

Nova seems to be a bit different from the other radios by playing shorter or longer tracks. Virgin, NRJ and Skyrock are really into 3-minute tracks.

When I first saw this graph, I couldn’t help myself to think about this Hocus Pocus’ song called “Voyage immobile” (motionless journey) and this sentence about our undiversified musical environment :

“Je ne voyais que blocs longs de 3 minutes taillé dans le roc et dans le même but”

“I could only see 3-minute blocks from the same base with the same goal”

Music genres

Spotify got some pretty weird music genres, have you noticed “post-teen pop”, “pop christmas”, pop songs you listen during christmas I guess? haha

We can clearly see that NRJ and Virgin, which are very alike, are more about pop/dance/electro music, their top 3 genres are : pop, dance pop and tropical house. Nova is about soul, funk and indie music, and Skyrock is more about rap, dance and pop

Hip hop genres

Skyrock is famous for its motto “1st on Rap”, let’s compare Hip hop/Rap genres (genres with “rap”, “hip” or “hop” inside the name) with the others radios.

OK, that’s a close match between Skyrock and Nova, let’s compare the internal hip hop genres now.

I don’t really care about genres, but there are a lot of confusion between Hip hop, which is a culture, and rap, which is the actual fact of rapping, if you want to learn more check this Wikipedia Chapter, I also recommend the excellent Netflix’s documentary “Evolution of Hip Hop”

Nova, in orange, is more about indie/alternative/undergroup hip hop music, and Skyrock, in blue, is really more into French rap/trap/hiphop and also popular rap. So let’s fix Skyrock’s motto by “1st on French rap” haha

Music classifier for Radios’ selection idea

In my last article, I explained how to create your own music recommendation system thanks to these audio features.

A fun project (the link is a tribute to the Scala Guru Martin Odersky, he tends to say too many times that his Scala exercises are fun whereas they are brain melt haha) would be to create an algorithm that will help music selectors to find radios style’s songs.

Spotify recommendation system

Spotify’s system is not only based on the audio features we saw earlier. It also analyzes what others similar users listen to. This slide contains a nice schema that explains their whole system.

What’s next?

Thanks to this project I have built solid foundations to query the Spotify API in Scala, process it thanks to Spark SQL, and visualize it thanks to Databricks. I think more projects are about to come, plus Spotify has just released, March 2017, this new endpoint “Recently Played Tracks” and ideas are coming.

Databricks pros and cons

Pros

  • Free community edition with 6Go RAM server
  • Awesome and easy-to-use Data Viz

Cons (or more, what can be better)

  • Can only visualize a maximum of 10 elements when using a GROUP BY, the others elements go to one category called “Others”
  • Not possible to choose the color of an entity, so a Radio can be blue on a graph and red on another, it can be sometimes confusing
  • Cannot export graph as iframe, so we have to export pictures from the interactive graphs
  • Cannot modify SQL on the Data Viz interface

Thanks

  • Databricks, for their awesome platform
  • Spotify, for their easy-to-use API and their human-readable documentation
  • Radio Nova for being a top music selecta, I would not listen to the same music that I listen to today without you
  • Marc H’LIMI, Radio Nova’s advisor, for our exchanges
  • Pierre Trussart, engineer and DJ, Benjamin Thuillier, scala rockstar, Nicolas Duforet, data science master, Justine Mouron, engineer
  • My friends for hearing me talking about this project too often

Bonus – Spotify Playlists

To thank for reading, I created 4 playlists of the most ~200 songs broadcasted sorted by the number of broadcast for :