Building Data Visualizations on the Web

@phil_renaud

Slides: acadia.riot.industries/

Github Repo: github.com/philrenaud/datavis-workshop

1. Why do we care about data visualization?

We're pretty good at recording data.

Too good, probably.

On an average day, Twitter generates about 500 million tweets. That's about 15 times the number of books in the Library of Congress
In August 2010, Eric Schmidt, former CEO of Google, said that between the beginning of time and 2003, humanity generated about 5 exabytes (5 x 10¹⁸ bytes) of data.
Now, we generate that much data roughly every 2 days.

These mountains of data add up. We record much more data than we can possibly analyze 1:1 in a lifetime.

Even when we aggregate data, it's not always clear what's going on.

And this is sort of a bummer, because

Data is only valuable when it's understood.

Even if you ignore problems that truly big data presents for our understanding, we still have a very strong case for the importance of Data Visualiation:

A: Some of our questions have conflicting answers.

It is a foregone conclusion that someone selling you something has an unobjective bias toward their product.
The climate of most commonly-shared news sources online is one of commodified viewership. Your attention is what's being bought.
In the name of persuasion, some sources are simply unreliable -- they require analysis.
Data lets us explore, lets us fact-check. Visualization of this data lets us do that in an accessible way.

Data makes critical thinkers of us.

B: Sometimes we ask the wrong questions.

We might ask "How many votes did the candidate receive?"
We probably mean "How many votes did the candidate receive, relative to Other Candidate"
Or, if we're trend-watching, "relative to their total from the last election", etc.

Data answers the questions we didn't know we had.

C: Sometimes the answer to a simple question is unsatisfying.

Q: How many Californians relocated from outside the USA?
A: About 28%
Okay but...
Q: Where do the other 72% come from?
Q: What about Californians who leave? Where do they go?
Q: How does this compare to last year? to 100 years ago?
Q: How does this compare to the US as a whole? To any given state?
nytimes.com/interactive/2014/08/13/upshot/where-people-in-each-state-were-born.html

D: Sometimes there are many distinct narratives to explore.

projects.fivethirtyeight.com/trump-approval-ratings

Data makes critical thinkers of us all

Anscombe's Quartet

4 Datasets with nearly-identical statistical properties

Anscombe's Quartet

Upon visual analysis, could not be more distinct from one another.

Statistical Detection of ... ?

source

Statistical Detection of Election Fraud

source

Visualization lends meaning to raw data.

2. Different tools for different jobs

In terms of static data visualization, we've been working at this for a long time. From 1910:

Good data design adheres to the properties of the visual encodings it uses

Certain types of visualizations are well-suited to particular data types:

Bar charts are boring, but well-suited to allow us to quickly determine differences in quantity:

Sometimes pie graphs are okay for this too, but be warned: at a glance, people don't see area, they see height.

Treemaps show quantity, plus a sense of hierarchy

Maps provide several ways to show physical location.

Maps are particularly versatile because of our common understanding of their expected shape; differences from our expectations let us see a different kind of story:

This cartogram shows expected popularion by 2300. It is only by changing our previous knowledge of the shape of the world that this has any effect.

Time-Series charts show how quantitative values change over time:

Scatter plots show us the differences between things in two dimensions

Network graphs show us the relationships between things - for example, this map of all of the followers of @CollideHalifax

3. Some common problems to avoid

So, when I see a Visualization, can I be sure that the Data Story I'm being told is correct?

Not always! Be critical.
Mark Twain's 3 kinds of lies:

"Lies, Damned Lies, and Statistics".

Avoid these common deceptions when designing data stories:

A: Maps as population indicators

NYC has ~ the same population as the shaded area

Slate's Equal Population Mapper

An aside: "The Map is not the Territory" -Alfred Korzybski

Geographic maps are often particularly tricky to parse correctly, because we have an innate desire to see territory when we want to see attributes of that territory.
Consider the many different ways one can represent the same map:
nymag.com/intelligencer/2018/03/a-new-2016-election-voting-map-promotes-subtlety.html
nytimes.com/interactive/2018/11/06/us/elections/results-house-elections.html
xkcd.com/1939/

B: Cumulative Time-Series Charts

C: Circles in general.

We're meant to see area; we commonly just see height.

D: Pie charts when dealing with close values

(Bar charts are boring but we're really really good at reading them)

E: Bar charts with arbitrary axis points

F: Seeing correlations when they're coincidental

source

4. Notable Interactive Visualizations and what makes them Great

The State of the Union in Context

source
source2

Bar Charts, Interactive

Casting Shakespeare

source

Scatter plot/ Time series, Scrolling narrative.

BBC: Scotland Decides

source

Normalized bar chart, Map. Allows user to sort data.

NPR - When are people working?

source

Area Graph. Allows user to visually compare.

Musicians' Deaths by Age

source

Time Series. Allows user to filter based on category.

Bachelors Degrees by Demographics

source

Visualizing MBTA Data

source

Parallel Coordinates, Maps, Bar and Area. Allows user to explore relative and absolute time, and compare stations/trains.

NY Times: How the Recession Reshaped the Economy, in 255 Charts

source

Loose interaction; the act of scrolling shows comparisons between industries. Sparklines. At the bottom, Small Multiples show the fuller picture.

The primary theme among the interactions presented here is that they allow users to complete a few common, basic actions:
Sort: Representing the data in different orders can reveal trends otherwise unseen
Filter: Only show the data that meets certain user-defined criteria
Compare: Allow the user to take pieces of the whole and see how they relate.

Why do these visuals feel so compelling, compared to traditional static media, or data in chart format?
The general pathway to understanding:
- Unstructured Information
- Data
- Structured Information
- Knowledge
- Wisdom
Visualization helps bridge the gap between Knowledge and Wisdom. Interactivity prompts exploration, and exploration leads to understanding.
And smarter users = good for everyone.

<Intermission>

(When we come back: D3, Observable, and Gephi)

</Intermission>

5. Little Data Lab

What's your preferred code editor?

Sublime Text?
Visual Studio?
Eclipse?
Vim?
R Studio?
What about Chrome?

A skill that data scientists often underdevelop is the ability to derive meaning from data on the fly.

Sometimes you don't have your ideal tools available
Even if you did, sometimes the person to whom you're telling a data story, doesn't understand your raw data.
We're going to use one of the most maligned programming languages, Javascript, to build an on-the-fly data visualization toolkit
Time to open some tabs!

Head over to
tinyurl.com/acadia-import
Copy the import code

When you have that copied, head over to
tinyurl.com/acadia-trade
Open up your Chrome dev tools console - CMD + ALT + i
(or right-click and select "Inspect Element", then hit ESC)
Paste in your import code from tinyurl.com/acadia-import and hit enter
Congratulations! You now have access to a very advanced data visualization toolkit.

Aside: making Chrome Dev Tools more comfortable

You can return clear your history with cmd + k
You can revisit previous history with ↑
You can access a control panel with cmd + shift + p
(try typing theme)

Aside: making Chrome Dev Tools more comfortable

Try declaring a simple array:


thing = [5, 10, 15];

Use console.table(thing) to see a manipulable interface:


console.table(thing);

You can do the same with an object, or an array of objects, or... mostly anything!

Okay, back to it.

Here's what you've just added to this page:

jQuery: document traversal and manipulation. “What’s the parent element of the first table?"
Underscore.js: Data manipulation and parsing. “What’s the total of the values in this array?"
D3.js: Data-as-a-document and visualization. “Plot the min and max values as colors on the red-to-green scale"

Let's try manipulating our data table.

A sample excercise: Use jQuery to find a given cell and change its background color


						$('td').eq(150).css('background-color', 'yellow')

"Which year had the highest total imports?"

jQuery: Show me the all the elements in the last row:


						$('tr:last-child td')

Okay! Underscore: what's the lowest value among their them?


lowVal = _.min(
  $('tr:last-child td'), function(obj,iter){
    return $(obj).text().replace(/,/g,'')
  }
)

jQuery, highlight it:


						$(lowVal).css('background-color','yellow')

Pre-made snippet for you:

tinyurl.com/acadia-gist1

Let's build on this. Let's highlight the import trends of each product from year to year.

Let's iterate (map) over each row (product) and:

1. figure out the min and max values
2. use D3 to paint the min values red, max values green, and scale the in-between accordingly
3. Iterate (map) over each cell in the row and paint its background according to where it sits in the d3 domain


$('tr').map(function(iter,row){
  var max = _.max($(row).children('td').map(function(iter,cell){  return $(cell).text().replace(/,/g,'') }))
  var min = _.min($(row).children('td').map(function(iter,cell){  return $(cell).text().replace(/,/g,'') }))
  var color = d3.scale.linear()
    .domain([min, max])
    .range(["#c33", "#0fc"]);
  $(this).children('td').map(function(iter,cell){
    $(cell).css('background-color', color($(cell).text().replace(/,/g,'')))
  })
})

Pre-made snippet for you:

tinyurl.com/acadia-gist2

Other countries to experiment with:

tinyurl.com/acadia-trade-all

Some observations:

2009: obvious downturn across almost all industries
Looking for a common exception? ctrl+f "Pharma" in almost every major country
Canada/Mexico: Food up, household goods down.
China: almost everything up
Iraq: the year before the US unilaterally pulled out of the war, almost all industries had an uptick
Haiti: noteable uptick in textiles (cotton, wool)
Syria: downtick across the board
Egypt: Fuel and Natural Gas dropped off the face of the earth
Belize: Fruits are down, vegetables are up (Thanks, paleo diet!)

Other visualizations, other data types

Let's build a bar chart over columnar data

tinyurl.com/acadia-baseball

Open up your Chrome dev tools console - CMD + ALT + i
(or right-click and select "Inspect Element", then hit ESC)
Paste in your import code from
tinyurl.com/acadia-import
and hit enter

Let's highlight the Batting Average column:


$('#teams_standard_batting td:nth-child(19)').css('background-color','yellow')

Now, use our iterative method from before to turn that column into a heatmap:

var column = $('#teams_standard_batting td:nth-child(19)')
max = _.max(column.map(function(){ return parseFloat($(this).text()) }))
min = _.min(column.map(function(){ return parseFloat($(this).text()) }))

  var color = d3.scale.linear()
    .domain([min, max])
    .range(["#c33", "#0fc"]);

column.map(function(iter,cell){
  return $(cell).css('background-color', color(parseFloat($(cell).text())))
})

Pre-made snippet for you:

tinyurl.com/acadia-gist3

Let's make a bar chart!

Since we already know the "max" value, we can assign that a max-width bar.
Every other value will be its ratio of that max bar's width

var column = $('#teams_standard_batting td:nth-child(13)')
max = _.max(column.map(function(){ return parseFloat($(this).text()) }))
min = _.min(column.map(function(){ return parseFloat($(this).text()) }))
barwidth = 250

column.map(function(iter,cell){
  $(cell).prepend('');
  $(cell).children('.bar').css('width', parseFloat($(cell).text()) / max * barwidth )
  var color = d3.scale.linear()
    .domain([min, max])
    .range(["#c33", "#0fc"]);
  $(cell).children('.bar').css('background-color', color(parseFloat($(cell).text())))
})

$(column).css('text-align', 'left')
$('.bar').css('height', '10px').css('float', 'left')

Pre-made snippet for you:

tinyurl.com/acadia-gist4

6. Notebooks as a Computational Essay

Head over to beta.observablehq.com/@philrenaud and follow along with the notebook

7. Introduction to Network Science

Head over to ncase.me/crowds and follow along with the interactive

8. A practical introduction to Gephi

If you haven't already done so, head over to gephi.org and download a free copy

Gephi is a tool to visualize and explore network graph data.

Lots of things can be networks! Any list of things ("nodes") with a relationship to one another (for example, people who follow one another on Twitter; words adjacent to one another in a book)
Let's grab a list of the characters in Game of Thrones who share a scene tinyurl.com/acadia-got. Open the raw file for got-network.graphml, download it somewhere you can access it.

Open Gephi and Open got-network.graphml
Options for Undirected, Directed, or Mixed graphs
Number of Edges and Nodes listed

Observe the 3 main views of Gephi: Overview, Data Laboratory, and Preview
In Overview, let's open our Statistics panel and segment our nodes with Modularity
Next, we'll change the appearance of our nodes and edges under the Appearance panel
Finally, let's explore some of the built-in layouts with the Layouts panel; starting with Force Atlas 2, and Yifan Hu

Thanks!

@phil_renaud

Slides: acadia.riot.industries

Github Repo: github.com/philrenaud/datavis-workshop

Building Data Visualizations on the Web

1. Why do we care about data visualization?

We're pretty good at recording data.

Even when we aggregate data, it's not always clear what's going on.

Data is only valuable when it's understood.

A: Some of our questions have conflicting answers.

Data makes critical thinkers of us.

B: Sometimes we ask the wrong questions.

Data answers the questions we didn't know we had.

C: Sometimes the answer to a simple question is unsatisfying.

D: Sometimes there are many distinct narratives to explore.

Data makes critical thinkers of us all

Anscombe's Quartet

Anscombe's Quartet

Statistical Detection of ... ?

Statistical Detection of Election Fraud

Visualization lends meaning to raw data.

2. Different tools for different jobs

3. Some common problems to avoid

"Lies, Damned Lies, and Statistics".

Avoid these common deceptions when designing data stories:

A: Maps as population indicators

An aside: "The Map is not the Territory" -Alfred Korzybski

B: Cumulative Time-Series Charts

C: Circles in general.

We're meant to see area; we commonly just see height.

D: Pie charts when dealing with close values

(Bar charts are boring but we're really really good at reading them)

E: Bar charts with arbitrary axis points

F: Seeing correlations when they're coincidental

4. Notable Interactive Visualizations and what makes them Great

The State of the Union in Context

Casting Shakespeare

BBC: Scotland Decides

NPR - When are people working?

Musicians' Deaths by Age

Bachelors Degrees by Demographics

Visualizing MBTA Data

NY Times: How the Recession Reshaped the Economy, in 255 Charts

<Intermission>

(When we come back: D3, Observable, and Gephi)

</Intermission>

5. Little Data Lab

Aside: making Chrome Dev Tools more comfortable

Aside: making Chrome Dev Tools more comfortable

Okay, back to it.

Here's what you've just added to this page:

Let's try manipulating our data table.

Other visualizations, other data types

tinyurl.com/acadia-import

6. Notebooks as a Computational Essay

7. Introduction to Network Science

8. A practical introduction to Gephi

Thanks!