Building Data Visualizations on the Web

@phil_renaud

Slides: acadia.riot.industries/

Github Repo: github.com/philrenaud/datavis-workshop

1. Why do we care about data visualization?

We're pretty good at recording data.

Too good, probably.

  • On an average day, Twitter generates about 500 million tweets. That's about 15 times the number of books in the Library of Congress
  • In August 2010, Eric Schmidt, former CEO of Google, said that between the beginning of time and 2003, humanity generated about 5 exabytes (5 x 1018 bytes) of data.
  • Now, we generate that much data roughly every 2 days.

These mountains of data add up. We record much more data than we can possibly analyze 1:1 in a lifetime.

Even when we aggregate data, it's not always clear what's going on.

And this is sort of a bummer, because

Data is only valuable when it's understood.

Even if you ignore problems that truly big data presents for our understanding, we still have a very strong case for the importance of Data Visualiation:

A: Some of our questions have conflicting answers.

  • It is a foregone conclusion that someone selling you something has an unobjective bias toward their product.
  • The climate of most commonly-shared news sources online is one of commodified viewership. Your attention is what's being bought.
  • In the name of persuasion, some sources are simply unreliable -- they require analysis.
  • Data lets us explore, lets us fact-check. Visualization of this data lets us do that in an accessible way.

Data makes critical thinkers of us.

B: Sometimes we ask the wrong questions.

  • We might ask "How many votes did the candidate receive?"
  • We probably mean "How many votes did the candidate receive, relative to Other Candidate"
  • Or, if we're trend-watching, "relative to their total from the last election", etc.

Data answers the questions we didn't know we had.

C: Sometimes the answer to a simple question is unsatisfying.

D: Sometimes there are many distinct narratives to explore.


Data makes critical thinkers of us all

Anscombe's Quartet

  • 4 Datasets with nearly-identical statistical properties
  • Anscombe's Quartet

    Anscombe's Quartet

  • Upon visual analysis, could not be more distinct from one another.
  • Anscombe's Quartet

    Statistical Detection of ... ?

    Elections data

    source

    Statistical Detection of Election Fraud

    Elections data

    source

    Visualization lends meaning to raw data.

    2. Different tools for different jobs

  • In terms of static data visualization, we've been working at this for a long time. From 1910:
  • Properties of Visual Encodings
  • Good data design adheres to the properties of the visual encodings it uses
  • Properties of Visual Encodings
    • Certain types of visualizations are well-suited to particular data types:
    • Bar charts are boring, but well-suited to allow us to quickly determine differences in quantity:
    • Bar Chart
    • Sometimes pie graphs are okay for this too, but be warned: at a glance, people don't see area, they see height.
    • Treemaps show quantity, plus a sense of hierarchy
    • Tree Chart
    • Maps provide several ways to show physical location.
    • Heatmap Pizza in Halifax
    • Maps are particularly versatile because of our common understanding of their expected shape; differences from our expectations let us see a different kind of story:
    • cartogram
    • This cartogram shows expected popularion by 2300. It is only by changing our previous knowledge of the shape of the world that this has any effect.
    • Time-Series charts show how quantitative values change over time:
    • Timeseries
    • Scatter plots show us the differences between things in two dimensions
    • scatter plot
    • Network graphs show us the relationships between things - for example, this map of all of the followers of @CollideHalifax
    • CollideHFX Network Graph

    3. Some common problems to avoid

    So, when I see a Visualization, can I be sure that the Data Story I'm being told is correct?

    • Not always! Be critical.
    • Mark Twain's 3 kinds of lies:

    "Lies, Damned Lies, and Statistics".

    Data Data

    Avoid these common deceptions when designing data stories:

    A: Maps as population indicators

    Populations

    NYC has ~ the same population as the shaded area

    Slate's Equal Population Mapper

    An aside: "The Map is not the Territory" -Alfred Korzybski

    B: Cumulative Time-Series Charts

    Data Data

    C: Circles in general.

    We're meant to see area; we commonly just see height.

    Data

    D: Pie charts when dealing with close values

    (Bar charts are boring but we're really really good at reading them)

    Data

    E: Bar charts with arbitrary axis points

    Data Data

    F: Seeing correlations when they're coincidental

    Spurious Correlations

    source

    4. Notable Interactive Visualizations and what makes them Great

    The State of the Union in Context

    SOTU
    source

    source2

    Bar Charts, Interactive

    Casting Shakespeare

    Shakespeare
    source

    Scatter plot/ Time series, Scrolling narrative.

    BBC: Scotland Decides

    Scotland Decides
    source

    Normalized bar chart, Map. Allows user to sort data.

    NPR - When are people working?

    When are people working?
    source

    Area Graph. Allows user to visually compare.

    Musicians' Deaths by Age

    When do Musicians Die?
    source

    Time Series. Allows user to filter based on category.

    Bachelors Degrees by Demographics

    Bachelors degrees
    source

    Visualizing MBTA Data

    Visualizing MBTA Data
    source

    Parallel Coordinates, Maps, Bar and Area. Allows user to explore relative and absolute time, and compare stations/trains.

    NY Times: How the Recession Reshaped the Economy, in 255 Charts

    Recession vs Economy
    source

    Loose interaction; the act of scrolling shows comparisons between industries. Sparklines. At the bottom, Small Multiples show the fuller picture.

    • The primary theme among the interactions presented here is that they allow users to complete a few common, basic actions:
    • Sort: Representing the data in different orders can reveal trends otherwise unseen
    • Filter: Only show the data that meets certain user-defined criteria
    • Compare: Allow the user to take pieces of the whole and see how they relate.
    • Why do these visuals feel so compelling, compared to traditional static media, or data in chart format?
    • The general pathway to understanding:
      • Unstructured Information
      • Data
      • Structured Information
      • Knowledge
      • Wisdom
    • Visualization helps bridge the gap between Knowledge and Wisdom. Interactivity prompts exploration, and exploration leads to understanding.
    • And smarter users = good for everyone.

    <Intermission>

    (When we come back: D3, Observable, and Gephi)

    </Intermission>

    5. Little Data Lab

    What's your preferred code editor?

    • Sublime Text?
    • Visual Studio?
    • Eclipse?
    • Vim?
    • R Studio?
    • What about Chrome?

    A skill that data scientists often underdevelop is the ability to derive meaning from data on the fly.

    • Sometimes you don't have your ideal tools available
    • Even if you did, sometimes the person to whom you're telling a data story, doesn't understand your raw data.
    • We're going to use one of the most maligned programming languages, Javascript, to build an on-the-fly data visualization toolkit
    • Time to open some tabs!
    • When you have that copied, head over to

      tinyurl.com/acadia-trade

    • Open up your Chrome dev tools console - CMD + ALT + i
    • (or right-click and select "Inspect Element", then hit ESC)
    • Paste in your import code from tinyurl.com/acadia-import and hit enter
    • Congratulations! You now have access to a very advanced data visualization toolkit.

    Aside: making Chrome Dev Tools more comfortable

    • You can return clear your history with cmd + k
    • You can revisit previous history with
    • You can access a control panel with cmd + shift + p
      (try typing theme)

    Aside: making Chrome Dev Tools more comfortable

    • Try declaring a simple array:
    • 
      thing = [5, 10, 15];
                    
    • Use console.table(thing) to see a manipulable interface:
    • 
      console.table(thing);
                    
    • You can do the same with an object, or an array of objects, or... mostly anything!

    Okay, back to it.

    Here's what you've just added to this page:

    • jQuery: document traversal and manipulation. “What’s the parent element of the first table?"
    • Underscore.js: Data manipulation and parsing. “What’s the total of the values in this array?"
    • D3.js: Data-as-a-document and visualization. “Plot the min and max values as colors on the red-to-green scale"

    Let's try manipulating our data table.

    A sample excercise: Use jQuery to find a given cell and change its background color

    
    						$('td').eq(150).css('background-color', 'yellow')
    					

    "Which year had the highest total imports?"

    jQuery: Show me the all the elements in the last row:

    
    						$('tr:last-child td')
    					

    Okay! Underscore: what's the lowest value among their them?

    
    lowVal = _.min(
      $('tr:last-child td'), function(obj,iter){
        return $(obj).text().replace(/,/g,'')
      }
    )
    		      

    jQuery, highlight it:

    
    						$(lowVal).css('background-color','yellow')
    		      

    Pre-made snippet for you:

    tinyurl.com/acadia-gist1

    Let's build on this. Let's highlight the import trends of each product from year to year.

    Let's iterate (map) over each row (product) and:

    • 1. figure out the min and max values
    • 2. use D3 to paint the min values red, max values green, and scale the in-between accordingly
    • 3. Iterate (map) over each cell in the row and paint its background according to where it sits in the d3 domain
    
    $('tr').map(function(iter,row){
      var max = _.max($(row).children('td').map(function(iter,cell){  return $(cell).text().replace(/,/g,'') }))
      var min = _.min($(row).children('td').map(function(iter,cell){  return $(cell).text().replace(/,/g,'') }))
      var color = d3.scale.linear()
        .domain([min, max])
        .range(["#c33", "#0fc"]);
      $(this).children('td').map(function(iter,cell){
        $(cell).css('background-color', color($(cell).text().replace(/,/g,'')))
      })
    })
    					

    Pre-made snippet for you:

    tinyurl.com/acadia-gist2

    Other countries to experiment with:

    tinyurl.com/acadia-trade-all

    Some observations:

    • 2009: obvious downturn across almost all industries
    • Looking for a common exception? ctrl+f "Pharma" in almost every major country
    • Canada/Mexico: Food up, household goods down.
    • China: almost everything up
    • Iraq: the year before the US unilaterally pulled out of the war, almost all industries had an uptick
    • Haiti: noteable uptick in textiles (cotton, wool)
    • Syria: downtick across the board
    • Egypt: Fuel and Natural Gas dropped off the face of the earth
    • Belize: Fruits are down, vegetables are up (Thanks, paleo diet!)

    Other visualizations, other data types

    Let's build a bar chart over columnar data

    tinyurl.com/acadia-baseball

    • Open up your Chrome dev tools console - CMD + ALT + i
    • (or right-click and select "Inspect Element", then hit ESC)
    • Paste in your import code from

      tinyurl.com/acadia-import

      and hit enter

    Let's highlight the Batting Average column:

    
    $('#teams_standard_batting td:nth-child(19)').css('background-color','yellow')
    					

    Now, use our iterative method from before to turn that column into a heatmap:

    var column = $('#teams_standard_batting td:nth-child(19)')
    max = _.max(column.map(function(){ return parseFloat($(this).text()) }))
    min = _.min(column.map(function(){ return parseFloat($(this).text()) }))
    
      var color = d3.scale.linear()
        .domain([min, max])
        .range(["#c33", "#0fc"]);
    
    column.map(function(iter,cell){
      return $(cell).css('background-color', color(parseFloat($(cell).text())))
    })
    					

    Pre-made snippet for you:

    tinyurl.com/acadia-gist3

    Let's make a bar chart!

    • Since we already know the "max" value, we can assign that a max-width bar.
    • Every other value will be its ratio of that max bar's width
    var column = $('#teams_standard_batting td:nth-child(13)')
    max = _.max(column.map(function(){ return parseFloat($(this).text()) }))
    min = _.min(column.map(function(){ return parseFloat($(this).text()) }))
    barwidth = 250
    
    column.map(function(iter,cell){
      $(cell).prepend('
    '); $(cell).children('.bar').css('width', parseFloat($(cell).text()) / max * barwidth ) var color = d3.scale.linear() .domain([min, max]) .range(["#c33", "#0fc"]); $(cell).children('.bar').css('background-color', color(parseFloat($(cell).text()))) }) $(column).css('text-align', 'left') $('.bar').css('height', '10px').css('float', 'left')

    Pre-made snippet for you:

    tinyurl.com/acadia-gist4

    6. Notebooks as a Computational Essay

    Head over to beta.observablehq.com/@philrenaud and follow along with the notebook

    7. Introduction to Network Science

    Head over to ncase.me/crowds and follow along with the interactive

    8. A practical introduction to Gephi

    If you haven't already done so, head over to gephi.org and download a free copy

    Gephi is a tool to visualize and explore network graph data.

    • Lots of things can be networks! Any list of things ("nodes") with a relationship to one another (for example, people who follow one another on Twitter; words adjacent to one another in a book)
    • Let's grab a list of the characters in Game of Thrones who share a scene tinyurl.com/acadia-got. Open the raw file for got-network.graphml, download it somewhere you can access it.
    • Open Gephi and Open got-network.graphml
    • Options for Undirected, Directed, or Mixed graphs
    • Number of Edges and Nodes listed
    • Observe the 3 main views of Gephi: Overview, Data Laboratory, and Preview
    • In Overview, let's open our Statistics panel and segment our nodes with Modularity
    • Next, we'll change the appearance of our nodes and edges under the Appearance panel
    • Finally, let's explore some of the built-in layouts with the Layouts panel; starting with Force Atlas 2, and Yifan Hu

    Thanks!

    @phil_renaud

    Slides: acadia.riot.industries

    Github Repo: github.com/philrenaud/datavis-workshop