The write-up below isn't exactly the same as the one found in the Data Sketches book. For the book we've tightened up our wording, added more explanations, extra images, separated out lessons and more.
When you're saying "music in December" to somebody from the Netherlands, there's a very likely chance that they'll think of the Top 2000. A yearly chosen list of songs that is played between Christmas and midnight December 31st. I've actually played with this data before. It was perhaps my 2nd serious personal project and I was still very new to d3. Since I sometimes see artists revisiting artwork they've done in the past to see how their style evolved (which I always love to see), I thought it would be fitting to try and to the same myself (in the sketch section you'll see I've had this idea for a few months already). So, 2 years since my previous attempt, I'm going to look at the Top 2000 again and visualize the insights.
Not to worry if you're not Dutch, the Top 2000 is probably 90% English songs, with Queen often on 1st place, and many Beatles, U2 and Michael Jackson songs, so the list should seem familiar to most of you.
The Top 2000 website thankfully shares an Excel file of the 2000 songs containing the name, artist and year of release. This year it was released on December 19th. However, I needed another important variable; the highest rank ever reached in the normal weekly charts. There are a few of these in the Netherlands, but I chose to go with the Top 40, since this has been going non-stop since 1965. But also because the Top 40 website seemed scrape-able. So I wrote a small scraper that would go through 50 years of charts lists and save the artist's name, song title, URL to the song's page and position. This data was then aggregated to make it unique per song (the URL of the song was the unique key) and save some extra info, such as the highest position ever reached and number of weeks in the Top 40.
Next came the tricky bit, matching those artists and songs to those in the Top 2000 list. Of course I first tried a merge on an exact match of artist and title. That connected about 60% of the songs. Browsing through the remaining songs I saw that sometimes one of the two lists was using more words than the other, such as John Lennon-Plastic Ono Band versus just John Lennon. So I also searched for partial matches between the lists, as long as all the words of one song and artist were contained in the other list. That helped match 10% more.
Then came the fuzzy part. Sometimes words are written slightly different, such as Don't stop 'til you get enough versus Don't stop 'till you get enough. Using R's stringdist package, I applied the "Full Damerau-Levenshtein" distance to compare titles and artists (it counts the number of deletions, insertions, substitutions, and transposition of adjacent characters necessary to turn b into a). However, I was quite strict and said that only 2 changes are allowed on both the title and artist to create a match (otherwise the song Bad from U2 could be turned into any other 3-letter song and 2-3 letter artist). Sadly, that only gave me 2.5% more matches. As a side note, I did have to quickly check all the matched songs after each step to take out a few wrong matches.
Almost 600 songs were still not matched. However, that's is not necessarily a bad thing, since not all songs made the Top 40. Number 3, Led Zeppelin with Stairway to Heaven for example, since he never released singles, only albums. But how to know which "unmatched" songs were meant to be unmatched, and which songs had actually appeared in the Top 40, but had failed to match? One final idea I applied had to do with the Tips of the week. Since ±1970 the music station airing the Top 40 also keeps a list of 20/30 tips; songs that aren't in the Top 40, but that the DJ's think will or should be. I therefore scraped that list as well for all available years and performed exactly the same steps as with the Top 40 list. This gave me 8% more matched songs for which I new for certain that they were never in the Top 40 (but they were tipped for it once).
For the remaining ±430 songs I just manually went through the list to look at artists or song titles that have long or odd names for which I thought the matching probably didn't work and therefore could be a Top 40 song, such as Andrea Bocelli & Sarah Brightman in the Top 2000 list versus Sarah Brightman & Andrea Bocelli in the Top 40. I still can't say for the remaining 380 songs how many of those did actually appear in the Top 40, but to be honest, after all the data processing and things I've learned and checked about the data along the way, I think it's less than 10%.
This spring I was in Juan Velasco's excellent workshop Information Graphics for Print and Online. Part of the workshop was to work out an infographic. And although my small team of 3 had written down about 40 possible ideas we were all intrigued by the Top 2000 songs. From experience with my previous attempt of visualizing this data I already knew that there was a very interesting shift happening in the most loved decade (of release) over the past years. Therefore, we choose to make that the general concept around which to base the different parts of the infographic.
The most recent list of 2000 songs would take center stage, visualized with a beeswarm plot idea which would group them around their year of release. Each circle (i.e. a song) would be sized according to their highest position in the Top 40 and colored according to their rank in the Top 2000. Some of these songs would then be highlighted with annotations, such as highest newcomer.
Finally, in the bottom section there would be a few mini charts that would highlight the distribution of the chosen songs on year of release between the 1999, 2008 & 2016 editions of the Top 2000. This would highlight the fact that in 1999 the bulk of the songs were released in the 70's, but that this has slowly been moving to newer decades.
On the 2nd day of the workshop we also made a mobile version of this concept. This time resulting in a long scrollable Beeswarm plot where you could theoretically listen to bits of each song and see extra information.
Although I won't have time to build the mobile version I still wanted to show our concept ^_^
So! For this topic I'm finally going to focus on making a static poster. Nevertheless, I'm still going to use d3 to build the center piece (the beeswarm). Afterwards I'll pull it into Illustrator. The smaller histograms at the bottom I'll take straight from R into Illustrator, so a whole combination of tools :)
Especially with d3v4 it's become much easier to create beeswarm like plots, since we can now define forces that run across an x and/or y axis. In this case I needed a force along the horizontal axis that would cluster the songs based on their year of release. It still took me several iterations to figure out the right balance of forces in the x and y direction (plus an offset by the collision detection preventing circle overlap) so it filled the region nicely around the year axis, without the songs being moved away too far from their actual release year.
In my first attempts I had sized the circles according to their highest position reached in the Top 40 and colored them according to their position in the Top 2000. However, that created a lot of light grey circles of about the same size (see the image below). It just didn't look appealing.
So I switched those two scales (thus size was determined by the Top 2000 position and color was the Top 40 position) and that immediately gave a great improvement. I then started to mark out the circles (songs) that I wanted to annotate later. I wanted to keep the visual very black and white, inspired by the intense blackness of vinyl records, and only use red to mark songs that had something interesting about them and blue for the artist / band with most songs in the list. However, with David Bowie and Prince passing away this year I just had to make a note of that as well, so I added yellow and purple.
Since the top 10 songs from the list were the biggest circles, I thought it would look nice to mark these as small vinyl records (which is nothing more than a very small white circle on top of a small red circle).
Also, a small simple tip: you cannot do an outside stroke on SVGs. Thus when you stroke an element, the width of that stroke is centered on the outline of the element. However, in this case I wanted the grey circles to be visible for their entire radius, not having some part taken away by the stroke (which was very apparent in the small circles). So instead I plotted colored circles behind the grey circles that would be a few pixels bigger and thus mimicking a colored outside stroke. In the animated gif below the version with the "bigger" looking circles is using the background circles so that the grey circles keep their true radius)
With those relatively simple elements done and being sure I wouldn't change anything anymore, I used SVG Crowbar to save the beeswarm SVG and opened it up in Adobe Illustrator. There I turned it 25 degrees, just for a bit more interesting effect, and started placing annotations around it (based on an underlying grid to keep things nicely aligned in columns and rows). I used the data and the Top 2000 website to figure out some interesting facts, like Justin Timberlake having the highest ranking song from 2016.
After finishing the beeswarm / top part of the infographic I thought that maybe it would be nice to have a small interactive version as well, so you could hover over all the circles and see which song it was. So I put in 2-3 hours to getting the beeswarm in a decent state, with a tooltip and legend.
I also wanted to touch on the fact that the distribution of the songs across release year has been changing towards the 90's & 00's. Since this was an added detail, a deeper insight into the data, I placed it below the main graphic. But it wasn't quite clear what visual form would convey the idea best. I already had the historic Top 2000 data from my previous visual on the topic from 2 years ago, so I appended the 2015 & 2016 data and started making some plots. That it should convey a histogram like approach was clear to me from the start, but should I smooth it down? How many years to show? Overlapping or in a small multiple fashion?
In the end I choose to go with a simple small multiple histogram of 4 editions across the past 18 years, but overplotted a smoothed density curve to make the general shape more easily comparable between the 4 charts. Below you can see what I took straight from R, created with the ggplot2 package (where I played with the color to also encode the height. Although eventually I made them all the same grey, since I didn't want the histograms to draw too much attention).
Another reason why I decided to create an infographic is because I just didn't have that much time this month; I only finished the previous visual for Data Sketches on December 7th, I had a small vacation to London planned (where I met Shirley again after 8 months YAY!) and we have the holidays at the end of December of course. And making a static visualization is always much much faster for me, even if it is partly based on something that I started in d3.
I think preparing and doing the data scraping & cleaning took about 20 hours, the ideation & sketching about 3 and the coding/creating about 20 - 30 hours (I keep telling myself to actually start keeping track, haha). After doing these static things I always remember how much I like making "printable" things.
When you say “music in December” to somebody from the Netherlands, there’s a very likely chance that they’ll think of the Top 2000. The Top 2000 is an annual list of 2,000 songs, chosen by listeners, that airs on Dutch Radio NPO2 and is played between Christmas and New Year’s Eve. I actually played with this data before in 2014, when I was still very new to D3.js
Since artists sometimes revisit a past artwork to see how their style has evolved (which I always love to see), I thought it would be fitting to try and do the same myself. So, two years after my first attempt, I decided to look at the Top 2000 songs again and visualize which decade was the most popular in terms of the years the songs were released. Not to worry if you’re not Dutch; roughly 90% of the songs in the Top 2000 are sung in English (with Queen usually in first place) so the songs should seem familiar.