The write-up below isn't exactly the same as the one found in the Data Sketches book. For the book we've tightened up our wording, added more explanations, extra images, separated out lessons and more.
Hi, let me start with some context: Nadieh and I originally came up with the topic back in March 2017, because the first ever data visualization community survey had come out (there's been two more since), and I wanted to do something with the results. I started on the project right after OpenVis Conf in Boston (where Nadieh and I both got to present about data sketches!), and it took until late September to complete the visualization and accompanying blog post. And honestly it really only took that long because I'm really slow and abhor writing, and then afterwards I was so done with writing about the project that I put off the process write-up until now...June 2019.
It's been quite a while since I've worked on the project (and a lot has happened since), so I'm going to try my best to remember what I did and what my motivations were. I feel like I'm playing detective, all I have to go off of are a bunch of screenshots, github commits, and sketchbook entries and I just have to piece it all together.
So here I go~
This was probably one of the best months for data: the survey had already happened, and Elijah (who had instigated the survey) had already been so awesome as to clean all the survey responses. It had 45 questions and 981 responses, all in a nicely formatted - and let me repeat, CLEANED UP - CSV file.
Honestly, I don't think I've ever had it easier. That's probably why I wanted to do the project in the first place 😜
So with my data collecting and cleaning already done for me (hehe), I went on to explore the data. The very first thing I did to try and understand my data was to list out all the questions I found interesting/relevant:
Since the premise of the survey was Elijah's claim that dataviz practitioners were leaving because of the current state of the industry, my primary question was: why might people leave the field? But because there's no direct way to measure that with the survey responses, I decided to go with a proxy: "Do you want to spend more time or less time visualizing data in the future?" I then organized the relevant questions into categories: how they got to their dataviz jobs, the aspects of their role that might affect their job satisfaction, and finally, their frustrations.
One of the talks at Openvis Conf had been about Vega-Lite, a charting library for quickly composing interactive graphics, and I decided to give it a try for my data exploration. Below is a set of histograms, and each of them answer the question "Percent of your day focused on X":
Which taught me that most people don't get to focus on any one part of the dataviz creation process, and probably have to juggle and do most or all the parts. I then tried to put some of the qualitative questions into bar charts:
And that taught me not to try visualizing any of the open-ended qualitative questions (because I'm not sure if I got anything useful from those bar charts), but rather to stick to those questions that had quantitative or multiple choice questions.
When thinking about why people might want to leave the field, the very first question I had was: are there any correlations between how much of their day is spent on creating dataviz, and whether they want to more or less of it in the future?
So I used a stacked bar chart, where the y-positions were the percent of day spent on data visualization, with the top bars being 0-10% and the bottom most being 90-100%. Colors represent whether they want to do more/less dataviz, with bars to the left of the gap being "much less" or "less", and to the right being "same", "more", or "much more".
But it turned out that majority of the respondents wanted to do the same amount or more dataviz creation - which might actually be more a bias of the survey, since the people filling out the survey would probably want to stay in the field going forward. So I wondered if that would change if I added the other parts of dataviz creation into it (data prep, data engineering, data science, design):
I also wondered if I could identify what part of the process people were most frustrated with, by seeing how much of their day was dedicated to that task, and whether they wanted to do more or less of it. Unfortunately, this is what it looked like with the data in:
It's definitely another case of good intentions, didn't explore the data enough. Also, I was only thinking about how to mash those three questions together and wasn't thinking about readability or how easy it'd be to understand (zero sympathy for the end user...who's usually me).
But the biggest problem with it was that it didn't tell me much because I was showing absolute numbers instead of percentages so I couldn't compare between the bars. But I also wanted to show the absolute numbers, because I wanted the user to be able to hover over (or brush and filter) and see the individual responses. And then it occurred to me: beeswarm plots.
For the second iteration, I decided to make the individual responses more obvious by representing them as dots in a beeswarm plot (which has the nice added benefit of being more compact and thus easy to glance through then the weird bar chart I had before). I also decided to not concentrate on the different dataviz functions, and concentrate on a different question: whether dataviz is a focus of their work. The dots were split vertically into the answers they gave, with top being "primary", middle being "secondary", and bottom being "one of several". And instead of using whether the respondent wanted to do more or less dataviz as the proxy (there were too many that answered "more", and not enough with "less" to make it an even split), I decided to use whether they answered with frustrations as proxy instead. I placed those that answered with frustrations on the left of the split, and those without on the right. The color was their years of experience doing data visualization, and a filled circle meant that they intended to go into dataviz.
I liked this first prototype, but was bothered by how hard it was to compare those that did or did not answer with a frustration so I put them on top of each other. I also realized that whether dataviz was their primary focus (or other questions like whether they had dataviz leadership) probably wouldn't have much relation with how many years they had been doing dataviz, so updated the metric to percent of day focused on dataviz and colored by that. And finally to make it easy to compare across answers - or even whether the respondent had frustrations or not - I added a box-and-whisker plot on top of the beeswarms.
I wasn't quite happy with the look of the box plot, and when I show it to my friend RJ, he immediately had the suggestion of having the box plot in the middle and those without frustrations above and those with below the box plots. He explained to me the importance of visual metaphors, that those with frustrations should drip down (like they're weighed down) and those without frustrations should rise up.
I implemented the split beeswarm by using D3's force layout (and in particular, the positioning and collision forces) to calculate the position of each dot, but at each tick I make sure the dots don't go past a certain vertical position (bounded force layout):
Added in the box-plot in the middle (though it took quite a few tweaks to get it to a satisfying point), and I had the visual for this month:
After that, I compiled the list of relevant questions:
I put two of the visualizations side-by-side, with the ability to change the input questions with a dropdown and filter the responses with a brush. I used React.js to help me manage all the state changes (though it was definitely a lot easier than when I had to keep track of the filters for Hamilton), and in retrospect, I'd definitely update the box plot on filter too.
I wanted to have at least two of the questions visualized at any given time because I wanted to compare between them and see if I could find any correlations. I knew early on that I'd want to use it for analysis, and to write a blog post with my learnings (I was on the hook to publish something for Visualizing, The Field that year). So with my exploratory tool completed, I went on to look at each question and jot down any interesting things I noticed:
To try and figure out what might cause people to leave the field, I decided to define a "successful" dataviz role as one that got to spend a large percent of their time on dataviz, with a higher perceived salary. I used those two metrics, and looked through the answers for what might correlate with "unsuccessful" dataviz roles, and found that those:
correlated the most with lower perceived salaries, and less time on dataviz. I then filtered by respondents that fell in those categories, and collected all of their frustrations:
I tried my best to categorize their frustrations, and with what I learned (most frustrations seemed to fall under two primary categories: those stemming from others in the organization, and those from the technology) I wrote about what we could do to potentially alleviate these frustrations. The blog post can be found here.
And this is my final exploratory tool:
Even though I really really abhored all the writing, I'm really happy about the content of my blog post. And though there are always things I want to improve, I also am satisfied with my final visualization. I was able to take away two important lessons: to use Vega-Lite or other similar charting libraries to quickly explore the data, and to use visual metaphors to better communicate the nature of the dataset (and more often than not, make the visualization more interesting). Most importantly, I'm glad of all that I learned analyzing the survey data and writing about the community's frustrations. It has informed a lot of what I do inside and outside of my work: to prototype on ideas and designs with my clients while doing my best to educate them on different charts and their uses, to get better on designing effective visualizations, and to create workshops and talks geared towards frontend developers to make D3.js and dataviz more approachable.
In February 2017, our friend Elijah Meeks made a bold claim: most people in data visualization end up leaving because there’s something wrong with the current state of the field. It stirred up quite a bit of conversation, and resulted in a community survey with 45 questions and 981 responses. By mid-March, Elijah had cleaned, anonymized, and uploaded all the data onto GitHub. And I knew I had to do something with that data.