I still remember Shirley and me getting an email from Alberto Cairo while I was on vacation in Myanmar last October. If we'd be interested in doing a data sketches style project for Google News Lab. Of course! I was thinking of going freelance anyway in the new year, so this would be a super project to start off with :)
We were completely free to supply our own ideas for topics, as long as it revolved around Google data. So Shirley and me made a list with about 4 topics, within which each of us would have our own spin on the topic as usual. Simon Rogers, our main point of contact at Google, picked the topic of Culture, something a bit light and fun after all those months of politics in late 2016.
My angle for culture came from something I experience with Google quite often. Being a Dutch native speaker dealing in English about 95% of my working day, I sometimes need to find the translation of a word into English. I either type "Dutch word in het Engels" into Google itself or go to Google Translate. And I was really interested to know what different languages want to have translated into English, especially if it's just one word. Do German speakers search for the same kinds of words as the Spanish? Would it reveal something about their cultures?
I worked with Jennifer Lee from Google to get the translation data. At first we wanted to look for actual Google searches; people searching for "native word into English" (where "into English" would be in their native language as well). But getting the corresponding translation of that native word proved a bit difficult. Therefore, we turned to the Google Translation team. An expert of theirs warned us that the results would probably be quite mundane, but both me and Simon (thankfully) were quite intrigued by that idea.
Rick Genter, from the Google Translate team, wrote a query to get all the single word translations that happened on Google Translate between August 2016 and December 2016 for 10 chosen popular languages. That was about as far back as the translation queries are saved. Also, doing the query once took hours and extra resources, so we couldn't get multiple subsets (e.g. one a month). Nevertheless, I was super excited about this one dataset anyway, even if no time comparison was possible.
Already before getting the actual data I had decided to only look at nouns (e.g. dog, chair). I knew that probably Hello or I or Thank you would be the most often translated words, and I was interested in the more subtle differences. So, after having received the data, I wrote a script in R that could help with picking out the nouns by tagging each English translation (and if available, tagging the original word as well) with its grammatical form with the NLP & OpenNLP packages.
Browsing through the results I saw three things. First, the adjectives, typically placed before a noun, such as young and intelligent, also held interesting results. Therefore, I decided to keep both the nouns and adjectives. Second, I saw that some languages use male and female words for the same English translation, such as hermosa and hermoso meaning beautiful in Spanish. Or synonyms for the same word, such as bonito again meaning beautiful in Spanish. Dutch however only has the word mooi that translates to beautiful. To get a fair comparison between rankings within languages I therefore combined the search frequencies for all terms that translated to the same English word (per language).
And third, as I'd expected, the automatically generated tagging results by R weren't perfect. For example, an English word can be both a verb or a noun, depending on the context. So to get a top 10 per language, I actually ended up looking up each original word that I wasn't sure about on Wiktionary to see what the most probable translation / grammatical form was. And that took a looooong time. But eventually I had a top 10 for each of the 10 languages that I was happy with.
For the overall language top 10 I looked at the rankings per language, not the sum of search queries, to compare relative popularity. I used a point based system where the number 1 word in a language would get the most points, the number 2 would get 1 point less, etc. The overall top 1 would then be the English translation that had the most points from all languages combined, and so on.
The sketching part this month actually happened before the data section, because Simon and Alberto wanted to see possible designs of visualizations and the overall page beforehand. Since I was dealing with words and languages I was inspired by the idea of using word strings as much as possible. Initially I wanted each line in my final visual to be comprised of the word that it represented
Above you can see the design that I send Simon and Alberto. I took a sort-of "start at the top and provide more details with each visual" approach. All the way at the top would be the most often translated word across all languages. Next, a visual with the top 1 per language which was interlinked by a swirling word string consisting of the top 100 translated words overall. Taking another step deeper would be the top 10 per language in a tree ring like structure, comparing all the original words to the English translation. To see similarities between languages I drew a network where each link would be a string of words representing the word that both languages had in common between their top 10. Finally, there would be a bump chart comparing the top 10 across time. This last chart eventually wasn't possible because I only had that 1 dataset without a time component.
For the first time, I created a "mood-board" on Pinterest just for a project. Collecting things that inspired me resulted in many black-white silhouette like children's book images, art like pieces which consisted mostly out of words and letters and cut-out handlettering.
During the project itself I also sketched a lot. Mostly because I was figuring out the math. All of the visuals have some form of texts placed on curvy paths. And for all of these paths I had to figure out the custom SVG path formula. As an added difficulty, the paths had to be constructed so the words on the path wouldn't overlap (too much) and the text should be upright as much as possible.
For the top visual; the word string linking the top 1 of each language, I wanted something that swirled around, a bit like my sketch below.
But I also had to make the page responsive, and I just couldn't find a programmable / mathematical way to create a swirl that would work for many different screensizes (and only later did I think of the fact that my original idea had crossing lines, which would result in unreadable text). Therefore, I started to work more with the idea of "beads on a string". Going down in an angle and having the word string follow the beads in half circles. It still took me a lot of sketches to figure out which layouts would work. And how things should look and function when either 2, 3 or 4 circles fitted in a row on the screen.
I had some other ideas in my head to make the circles more tightly packed, but only after even more sketches did I figure out that it wasn't possible mathematically (which you don't notice that well when sketching, since lines and circles aren't perfect)
For the second visualization of the "tree-ring" top 10 per language I didn't have to sketch that much, it was pretty straightforward. The third visual of the network on the other hand... Although the lines inside the network weren't too difficult, the problems arose when you click on an outer circle which makes it move towards the center. I'll spare you the details, but it had to do with the text always reappearing in upside manner (not upside down) while the lines seemed to move with the language circle as well. Figuring out how that was supposed to work took even more pages in my little sketch book.
The final thing I looked into was the overall design of the page. I wanted to recreate that old children's book, black cut-out style, imagery. But I couldn't really find a way to incorporate that with the theme of translations (or a responsive page). So I eventually turned to handlettering and decided to create my own title and subtitles. Drawing them on my iPad Pro (with the Tayasui Sketches app, love that app). I started out trying to make the letters look smooth and flowing nicely. However, for an inexperienced hand like mine, that was very hard to do. Furthermore, I wasn't quite happy with the resulting style. So instead, I made them look a bit quirky and imperfect. Distinct from anything else I'd managed to find on Pinterest.
In terms of coding I started with the tree ring like visual that displayed the top 10 of a language. I'd worked with arcs often enough in the past (and especially the previous datasketches months) to set up the basic forms in an hour or so.
I then focused on how to animate the switching between the languages. Due to the space, especially on a mobile phone, I made one big circle in which you could read the top 10. The other languages would be smushed into tiny, almost flower-like, circles. In my first design, the tiny circle would move into the place of the big circle and simultaneously open up to reveal its top 10. However, when I had that working, the animation was extremely staggered. So much so that it was actually a bit hard to see that these two circles switched position (see the gif below).
So I set about trying a different approach. For the 2nd idea, the text in the circle would rotate out of view before rotating back in again in the new chosen language. When I finally overcame all the browser bugs to achieve this I was actually quite happy. I liked the result more than my previous idea of switching. However, it was still a bit jerky. I therefore looked into doing it in canvas, but there's no way to place text on a path in canvas (there is a hack that I used in my Occupations piece, but in the way I was using the text in this project that was going to be way to complicated). I even asked Dominikus Baur for help when we were both at the INCH conference. But the issue was not in the script but in the drawing on screen, so in the end I just had to accept that this was as good as it was going to get.
As you can see from comparing the two gifs above, I also changed the design of the big ring. Alberto Cairo suggested only making 1 word stand out, otherwise there was too much going on for you eyes to focus on. Sadly, you can't use different styles in 1 textPath element. So there are now 3 different textPaths there; one light grey on the left and right displaying the original word and a bigger black word (the English translation) in the middle. Figuring out how to place those 3 elements so it would look like it was 1 string of words (and never overlap) was another interesting puzzle to solve, hehe
Next up was the word string, or "word snake" as I started calling it actually. Because I couldn't really mathematically figure out the natural looking swirl, I first tried a different route; to create circles. The top 100 words overall would then be shown as the "strokes" of the circles. However, to prevent any clipped words, I couldn't use the entire circumference of each circle. And because the strings were now separated, I had to number each of the words, otherwise there was no way to know for sure where in the top 100 one of half-circle sections belonged.
In short, I wasn't quite happy with the result. Therefore I revisited the swirl idea, but this time I tried an approach where I would see the circles as "beads on a string". In that set-up it was mathematically possible to make 1 string of words going around the circles representing the most often translated noun/adjective of each language. Of course, actually coding up the correct SVG path formula took some trial and error...
Due to screen size, I created several options for the "beads" to be positioned. It calculates if 2,3 or 4 circles fit in a row and the rest updates automatically. Personally, I like the 2-beads per row version the most (the left most version in the image below), but that just left too much white space on the sides for wider screens.
In a call with Alberto and Simon they both told me that due to the word string now snaking around each of these languages, the language "circles" looked like they could be clicked, like they were buttons. So I thought about what interesting info I could add that would be displayed to reward the people that actually tried to click the languages.
Which brought me to Google Trends, where I looked up the trend of these most translated words and their related queries. And thankfully, all of the words had some interesting peaks and dips going on. So I made the most extensive tooltip I've ever made. On the top it shows the worldwide trend of the English word over the last 5 years. With some digging into the periods displaying the peaks and dips I found explanations for each (and annotated them using Susie Lu's awesome annotation library). Below the line chart is a word cloud about the most interesting related queries. Mama was definitely my favorite, with related queries such as maternal insult and yo mama.
The final visual revolved around the similarities between the 10 languages' top 10 words. Each language was a circle and each top 10 word that two languages have in common becomes a line. The highest similarity, 4 words in common, happens between Spanish, Portuguese and Italian, but also for Russian and Polish. So there does seem to be something cultural there defining the most searched words :)
I first had to set-up a system in which two "nodes" in a network could be connected by multiple "links". Typically people just use a thicker line to represent something where two nodes are more highly linked. But because I wanted to eventually change the line by their actual words, I had to find something else. Using ever more curved lines seemed like an elegant solution.
And then I replaced the lines by the words themselves, in a similar design as the tree ring, and saw; chaos...
I totally forgot to think about the fact that many lines would overlap, making it impossible to read the words. But also that the words themselves would make the visual extremely full and cluttered. Therefore, I put the lines back in and only kept the words for those links attached to the central language circle. That already felt better, but Alberto advised to only keep the central dark (English) word to really get the focus, which is what you see in the final result.
As for the word string visual, this one also has a difference between desktop and mobile. To maximize the space available on a mobile screen, the network is no longer a circle, but a rectangle, giving more space to the lines and the words on the lines.
I already talked about my biggest "struggle" with this visual in the sketch section; making sure the transition between the languages looked natural while always having the words appear (mostly) upright. This was actually more of an understanding problem than a coding issue, but there is 1 small "hack"; right after clicking an outside circle the words disappear and then I immediately replace the lines by their new "versions" (but in the "old" state) after which the transition starts, making it all look smooth :) (and nobody noticing there was ever something complex going on)
And all those separate elements combined turned into the following final result. Starting out at the highest level of the most translated word overall and diving ever more deeply into the differences and similarities between languages. Have fun playing around with it and seeing some of the expected or rather odd words that people want translated through Google Translate!
It was really awesome to work together with Google and Alberto Cairo (next to Shirley of course) and create an extensive visualization based on data that we could request from such a vast source such as Google; data that is impossible to find online!
When Nadieh and I got the email from Alberto and Simon to work with Google News Lab, I was ecstatic and beyond intimidated. Afterall, it was Google, it was Simon Rogers and Alberto Cairo, and they had search data back to 2004. They had already published projects from Accurat and Truth & Beauty, and I wasn't sure if I could live up to them.
But I was determined to try my best.
Nadieh and I explored Google Trends and came up with several proposals, and Simon ended up choosing our Culture proposal. Nadieh was to look into language, the most common words a country searched for to translate into English. I wanted to look into travel, at what places a country searched for in another.
Since Simon preferred having the data we displayed live instead of being a snapshot in time, I subcontracted my friend Charles to build a web app with a database to serve up the data we queried from Google Trends. (Let me repeat, I subcontracted someone. For the first time. I feel so adult.) Since I usually work by myself on my data sketches, it was a great feeling to have another person work on this project with me; not only did he get and clean all the data so I didn't have to, he was great to bounce ideas off of since he was so intimately familiar with the data.
When I started, I wanted to know given a country, which countries were searching for that country the most. I wanted to know, were they looking for cities in that country? Museums? Specific landmarks? And as I started digging into the data with Google Trend's Explore function, I also started to wonder if countries looked for places geographically closer to them; this thought came to mind when I saw that Australia when looking at the U.S. searched primarily for places on the West Coast.
The way Google Trends works, we can put in a set of search terms (up to five) and get back their search interest, all the originating regions, as well as related and top topics over a specific time period. We can also specify a specific originating region, and a category to filter by. To get the data we needed, Jennifer Lee at the News Lab suggested we search for every country (with Google's list of country id's) filtered by Tourist Destinations and for all time. Since Google Trends returns the search interest back to us as a relative value out of 100, she also suggested that we go through all the countries but leave one of the countries in as a baseline. That way, we can accurately get the top 20 countries by travel search interest.
After getting the top 20 countries, we got the top regions that searched the most for each of those countries every quarter starting from 2004, and then the top topics those regions searched for. It sounded reasonable when we first came up with the queries, but when we got back the data, it was overwhelmingly vast. I meandered for weeks trying to make sense of the data, creating visuals to dig through the data, trying to figure out if there was an interesting story buried in there. At one point, we decided to get travel topics for all the countries around the world and not just the top 20, and we ended up with thousands of topics with hundreds of categories.
In order to make sense of the data beyond simple frequencies of countries searching for each other, we needed a way to categorize the data. We wanted categories that are specific enough to be meaningful but broad enough so that the viewer isn’t confused by the sheer number of categories. To determine which one of our 8 categories each topic belongs to, Charles pulled the topic details off of Google's Knowledge Graph Search, which includes images, descriptions, and an array of types for the topic. For each topic, its types in the Knowledge Graph correspond to a list of “tags” picked out of a predefined set. We immediately settled on cities, people, and nature being three of the categories. The remainder were more difficult to define -- among the types given by the Knowledge Graph, we wanted to pick the type of the proper specificity -- not too broad like “Thing” or “Place”, but also not too narrow like “LodgingBusiness” or “MovieTheater”. Charles produced a mapping of Knowledge Graph types to our chosen categories; for the 45 topics that either fell into multiple categories or no category, we manually assigned the category.
For my sketch and code sections, I really meandered trying to figure out my visuals and narrative. I started by exploring for Brazil (the top searched country for travel) which countries had searched for it in a given year, defaulting to 2016: And this was really cool because I could see that the countries closer to Brazil did indeed search for it more, but I also wanted to see at a glance all the searches across the years. I thought up these pie charts placed above the centroid of each country, with the radius being the amount of search interest, and the colors being the years. I liked that it showed me that some countries were searching for Brazil since 2004, but others only started searching recently - this was interesting, but potentially misleading; we couldn't tell from our data if those countries only started becoming interested recently, or only started using Google recently. Alberto - who was responsible for our art direction - also adviced me against the bubble chart/pie chart hybrid, since it could potentially confuse readers.
So I went back to the drawing board. This was around when I took my first stab at categorizing the topics (Charles did the more sophisticated version later on), and I wondered if instead of focusing on just one country at a time, I could show all the countries and their topics from the get go. The idea was that each topic was a block colored by their category and grouped by their year, in a circle around an outline of the country it belonged to: Though this version was certainly pretty, the circle format made it hard to compare across the years and across the countries (though it did give a good at-a-glance summary of the countries), and Alberto urged me to try a normal bar graph instead.
And since I was going to turn it into a bar chart, I thought maybe I could play around with the length of the blocks (which I couldn't before). I wondered if on top of category and year, if I could also encode the popularity of a topic as its width. This was an unfortunate mistake, and I call this piece The Plunger: I did like my idea of trying to show the popularity of a topic, so this time around, I tried to do it with the radius of the circle. So each circle represented the topic searched by an originating region, and the more overlap there was for a set of circles (and thus darker) the more originating regions that had searched for that topic for that country. I liked this much more than all my previous attempts, and iterated on it a bit more, including a section on the right that expanded all the topics in a year and lined them up by originating region. The originating regions were positioned by their proximity to the country being searched. The expanded topic view didn't give as much insight as I was hoping for (especially since the x-axis was arbitrary), so I tried a different approach of showing just the selected topic over the years. To try to fix the overlap of the originating region names, I spaced them out evenly and also experimented with a heatmap to see if it would look cleaner. I liked this last version enough, and got to brainstorming how I wanted to introduce the visuals and provide a key to reading them. I decided I would find the most searched for topic (Pattaya, Thailand) and write a story around it that would also introduce the bubble graph and heatmap. After digging through the two visualizations for any interesting insight into Pattaya and banging my head on the desk for an afternoon, I had to face the truth my friend was pointing out: perhaps I should rethink my visualizations, because though they were pretty enough, digging through them wasn't getting me anything useful.
So I went back to brainstorming and asking myself what it was that I wanted to learn about the data. I remembered in my digging for Pattaya the seasonal nature of some of these topics' search interests, and wondered if there was something interesting there - were certain continents searched for more in summer as opposed to winter or vice versa? Here, the top section is Spring, and the bottom is Summer. Everytime there is a tall set of blue blocks, it indicates the start of a continent, and the continents are ordered by closest to furthest from U.S. Each block is a topic the U.S. has searched for. And yet again the visual didn't go the way I was hoping for; it turns out for each continent, the number of topics searched for are always the same across the seasons. I was so bummed that I left the cafe I was working at, but on my drive home realized that I should try to size the height of each topic by their search interest. This actually gave interesting results for which I was really excited: For each topic, I was curious about the rise and fall of of it across the years. The idea of it was that the x-axis was the search value out of 100, and each circle was a year. An arc above meant that searches had increased across a year, whereas below indicated a decrease across a year. I learned quite a bit from this visual: for example, a lot of topics actually peaked in 2004 and have been declining since, with a lot of them dipping the most between 2008 and 2011 (the years of the financial market crashes). But as useful as those insights were, I had to admit that it took a lot of effort to get those insights from the visualization.
I put aside the visualizations for the topic details for a bit to work on the story. In mid-February, I had taken a Web Animation Workshop with Sarah Drasner and Val Head, where I learned the basics of how to animate with Greensock. With that knowledge, I wanted to create "scenes" that explained the visuals in detail, and my first pass was with scrollytelling. However, I was unsatisfied with how much vertical space I was taking to show as simple of a concept as topics and categories. I went back to working on the topic details, and went with Alberto's suggestion of a line chart, as well as the world map with bubbles over each originating region: With these views, I was able to explore and find an interesting story about the seasonality between Qin Shi Huang and his Terracotta soldiers, and also the stories to introduce topics, categories, and my originating regions visualization. I was especially interested in figuring out how to introduce those visualizations in a space-efficient way, and remembered the discussion between scrollytelling and stoppers a while back. I decided to try animating with the stoppers, and I like the way it turned out; with Greensock I was able to have smooth transitions between each step and it's probably the best looking set of animations I've ever made.
It's been quite an arduous journey, where I changed my visualizations almost every week for a month (and I felt super bad every time Simon and Alberto gave feedback and I came back with a completely different visual the next week), didn't know what to do with my data, and didn't know if I was going to find anything interesting to tell. I think a lot of the flip-flop really came from two things: the vastness of the data (and my lack of knowledge on how to glean through them efficiently for insights), and my not sticking to one question all the way through. I jumped around asking a bunch of different questions, but ended up with the same questions as the beginning: what does a country search about other countries, and what do other countries search about it?
Despite all of that, I'm glad I was able to end up with something I can be proud of. I like the stories I found, and I like the exploration section I ended up with (though I wonder if I've used too many colors). And I'm really grateful to Simon and Alberto for giving me this opportunity and for their guidance, and of course for Nadieh's constant encouragement.