Category Archives: Example

Data APIs and Visualization: The Sum is Greater Than the Parts

Published by:

U.S. Government data has always been available for free to the public, but it can be tricky to find data curated to the point that it is easily consumed by all of our fancy visualization tools.  The folks over at datausa.io have done some interesting work building a data API that works across a variety of common and useful government statistics.  I was curious to see what the potential might be when we take a simple web data API and feed its content to a simple visualization language…

The mechanics turned out to be easy since the datausa.io API allows results formatted as CSV–which is what Brunel Visualization can consume.  So the data query essentially boils down to a URL placed inside a Brunel data() statement.  Visualizations can even be immediately created by pasting these URLs into the “data” section of the Brunel Visualization online app.

So, on to some examples..  This first one uses workforce data from the ACS PUMS data provided by the US Census Bureau.  The top graph shows a heatmap of wages by hours worked per week (binned) and colored by age for full time employees.  The age value is the median of the average ages of the occupations in the bin.  Note: it would probably be better here to calculate a weighted average of age using the field containing the number of people within the occupation.  Click on a cell to see the occupations within it below on the bubble chart.  The size of the bubble represents the number of people in the occupation and the color corresponds to the Gini coefficient.  Higher (darker) Gini values indicate greater inequality wages for the occupation.

It’s interesting to poke around with some of the outlying cells to find the occupations with the highest wages and shortest hours or vice versa.  Also, the occupations in these outlying cells seem to have the most consistency for wages.

The full code (including retrieving the data) for the above example is:

data('http://api.datausa.io/api/csv?show=soc&sumlevel=3') 
x(avg_wage_ft) y(avg_hrs_ft) color(avg_age_ft:red) median(avg_age_ft)  
    bin(avg_wage_ft, avg_hrs_ft)  interaction(select) | 
bubble x(soc_name) color(gini_ft:blue) size(num_ppl_ft) label(soc_name) 
    sum(num_ppl_ft) tooltip(#all) interaction(filter)

As expected, a lot of government data is summarized geographically.  This next example uses health metrics aggregated at the state level from the University of Wisconsin’s County Health Rankings.  The histograms show the distributions of four of these metrics across all 50 states.  Roll the mouse over a histogram bar to highlight (brush) to see which states correspond to those values–or, click on your state (if you reside in the US) to see where the values for your state land for each metric.

Again, the full source code is:

data('http://api.datausa.io/api/csv/?show=geo&sumlevel=state&required=adult_obesity,health_care_costs,diabetes,excessive_drinking&year=latest') 
map key(geo_name) opacity(#selection) tooltip(geo_name,adult_obesity,health_care_costs,diabetes,
    excessive_drinking) at(0,0,100,50) interaction(select) |  
bar x(adult_obesity) axes(x) y(#count) bin(adult_obesity) opacity(#selection) stack 
    interaction(select:mouseover) at(0,50,50,75)  |  
bar x(health_care_costs) axes(x) y(#count) bin(health_care_costs) opacity(#selection) stack 
    interaction(select:mouseover) at(0,75,50,100)  |  
bar x(diabetes) axes(x) y(#count) bin(diabetes) opacity(#selection) stack  
    interaction(select:mouseover) at(50,50,100,75) | 
bar x(excessive_drinking) axes(x) y(#count) bin(excessive_drinking) opacity(#selection) stack 
    interaction(select:mouseover) at(50,75,100,100)

Having served in a government data agency in the past, I am well aware that a major concern that comes with this type of flexibility is the potential for misuse by not reading the fine print about what the data is and what it represents.  Nonetheless, powerful data APIs combined with flexible, rapid visualization design provide significant and interesting learning opportunities.

Snowzilla!

Published by:

After shoveling the driveway several times and burning through the Netflix que, one way to counter act cabin fever is to hunt down some snowfall data and play around with it.  So, I found some data over at the National Weather Service that contains snowfall depth measurements collected from a variety of sources around the region at various time points during the storm.

The map shows the maximum snowfall depth at any given location recorded from Friday until Sunday.  The deepest measurements are labeled. The area near West Virginia clearly bore the brunt of the storm, but there were some areas closer to DC that came close.  Everyone pretty much got a lot of snow.

snowzilla_mapBrunel Code:

map('usa') + x(Lat) y(Lon) max(Snowfall) color(Snowfall:blues) tooltip(City,Snowfall) style("stroke-width:0;opacity:.4;size:15px") + 
map('labels')  + x(Lat) y(Lon) max(Snowfall) top(Snowfall:10)  label(Snowfall, '"') text style("font-family:Impact;fill:darkblue") tooltip(City,Snowfall)

Apparently there was a bit of controversy regarding the exact snowfall measurement at Washington National Airport.  To try to look at this, I added a timeline graph that is linked to the map so I could see the snowfall amounts at different time points.  The data are binned to roughly 2 hours and these bins are colored by the number of measurements taken within the time range.  Clicking on a bin shows the measurements on the map–and I zoomed in to the airport.  I do not appear to have all the data showing the issue; but, I can see measurements in nearby areas and who did them.  Perhaps the upshot was that the difference is significant for historical and business reasons–but it probably won’t make your back feel much better.

snowzilla_zoom_mapBrunel Code:

map('usa') at(0,0,100,75) + x(Lat) y(Lon) size(Snowfall:200%) max(Snowfall) color(Source) label(Snowfall,City) tooltip(Snowfall, City, Source) interaction(filter)  at(0,0,100,75) + map(labels)  at(0,0,100,75) |  x(Time) bin(Time:20) color(#selection) opacity(#count) interaction(select) tooltip(Time)  at(0,85,100,100)

Since the storm lasted nearly 36 hours, it can also be interesting to look at the depths over time.  The variable sized paths below show that the snow generally started to really pile up Friday night and also that Maryland and West Virginia seemed to reach their peak a little bit sooner than Virginia.  The boxes that are overlaid on the paths show the number of measurements that were taken at binned time intervals.  More measurements were taken in Virgina and Maryland–and most measurements seem to have been taken towards the end of the major accumulation.

snowzilla_time

Brunel Code:

path x(Time) y(State) color(State) bin(Time:20)  size(Snowfall) max(Snowfall) legends(none) + x(Time) y(State) color(#count) bin(Time:30) style('height:20px')

Lastly, if you are familiar with the area, you’ll quickly recognize the county names.  Below is a cloud with county names sized by the max snowfall depths and colored by their state.  Counties with larger snowfall amounts appear more towards the center.  Most names are nearly the same size because everyone got a lot of snow!

snowzilla_cloud

Brunel Code:

cloud color(State) size(Snowfall:150%) label(County) max(Snowfall) sort(Snowfall)  style('.element {font-family:Impact;}')

Maps Preview

Published by:

A short update today; we have been working on intelligent mapping for Brunel 1.0 (due in January) and since it’s a subject many people are interested in, we thought we’d put up a “work in progress” video showing how things are progressing. It’s a rough video, so you get to see my inability to type accurately as well as some rough transitions. Showing the video at full resolution is recommended.

Usual disclaimers apply: this is planned for v1.0 in January, we expect it to work as described, but no guarantees — Enjoy!

Blogging with Brunel: Part 2

Published by:

 

Graham’s earlier post demonstrated how you can use the online Brunel app (http://brunelvis.org/try) to try out Brunel and include live visualizations that show your data within blog posts or other online content.

Partly because it is somewhat self-serving–since we will use this very blog for this exact purpose; but mostly, because we feel this is something Brunel can offer, we have updated the app with several new features and a new look.

The video demonstrates all of the new features, but the major ones are:

  • The Gallery and Cookbook are now integrated directly into the app
  • More editing features (titles, descriptions and resizing)
  • Uploaded or referenced data will show the individual data fields–selecting any one will show a quick visualization of that field
  • More deployment options.  Specifically, there are two new options that produce self-contained visualizations that do not require our service for deployment.

Feel free to give it a spin at http://brunelvis.org/try.  Thoughts?  Ideas?  Feature requests?  Issues?  Let us know on github..

Using Brunel in IPython/Jupyter Notebooks

Published by:

 

Analytics and visualization often go hand-in-hand.  One of the great things about notebooks such as IPython/Jupyter is that they provide a single interface to numerous data analysis technologies that often can be used together.  So, using Brunel within notebooks is a very natural fit.  For example, I can use a wide variety of python libraries to cleanse, shape and analyze data–and then use Brunel to visualize those results.

Additionally, coming up with a good visualization is a highly iterative process:  Try something, look at the results and refine until done.   So, again, the notebook metaphor of having live code execution near the results is extremely convenient.  Lastly, since notebook cells containing output can themselves be interactive, direct manipulation techniques such as brushing/linking, filtering and selection are also available.

To try this out, we have provided an integration of Brunel for Python 3 that runs in IPython/Jupyter notebooks.  Details on how to install and get started are on the PyPI site.  The video above gives a very small taste of the kinds of things that are possible.

 

Villains of Doctor Who

Published by:

I’ve always been a big Doctor Who fan; growing up with the BBC and seeing many incarnations of the Doctor striding across the TV screens, defeating his enemies armed with intelligence, loquaciousness, and a small (admittedly sonic) screwdriver. In particular I recall being terrified of the villains in “The Talons of Weng-Chiang“, which, nowadays, do not seem particularly scary. But the villains of Who have always been magical!

So I was excited to find a data set of Doctor Who villains through 2013 (courtesy of The Guardian) and used it for a short Brunel demo video. I left it as-is for the video, but it was clear the data set needed a bit of cleaning. The column names were more like descriptions than titles, which was annoying, but the biggest issue was the Motivation column, which was more like a description than  categorization. So I edited the data a little — changing the column titles and then providing a manual clustering of motivation into smaller categories, creating three motivation columns: Motivation_Long, the original; and Motivation_Med, Motivation_Short — my groupings of those original categories. With these changes, I saved the resulting CSV file as DoctorWhoVillains.csv. You can check out an overview of the motivation columns in the Brunel Builder.

As usual with data analysis, it took way longer to do the data prep work than to use the results! I quite like this summary visualization, which is simply three shorter ones joined together with Brunel’s ‘|’ operator:

Doctor Who Villains through 2103

Doctor Who Villains through 2103

The bubble chart and word cloud show pretty much the same information — the cloud scales the size by the Log of the number of stories (otherwise the Daleks tend to exterminate any ability to see lesser villains) and is limited to the top 80-ish villains by appearance count. The bottom chart shows when villains first appeared and their motivation. The label in each cell is a representative villain from that cell, so the Sensorites are a representative dominating villain from the 1960-1965 era. The years have been binned into half decades. At a glance, it looks like extermination and domination are common themes early on, whereas self interest is more of a New Who (post-2000) thing. Serving Big Bad is evenly spread out over time.

 The Brunel script for this is quite long, as I wanted to place stuff carefully and add styling:

data('http://brunelvis.org/data/DoctorWhoVillains.csv') bubble color(Motivation_Short) size(Episodes) sort(First:ascending) label(Villain) tooltip(Villain, motivation_long, titles) at(0, 0, 60, 60) style('* {font-size: 7pt}') | data('http://brunelvis.org/data/DoctorWhoVillains.csv') cloud x(Villain) color(motivation_short) size(LogStories) sort(first:ascending) top(episodes:80) at(40, 0, 100, 55) style(':nth-child(odd) { font-family:Impact;font-style:normal') style(':nth-child(even) { font-family:Times; font-style:italic') style('font-size:100px') | data('http://brunelvis.org/data/DoctorWhoVillains.csv') x(motivation_short) y(first) color(episodes:gray) sort(episodes) label(villain) tooltip(titles) bin(first:10) sum(episodes) mode(villain) list(titles) legends(none) at(0, 60, 100, 100) style('label-location:box')

 Without the data and decoration statements, this is what it looks like — three charts concatenated together with the ‘|’ to make an visualization system:

bubble color(Motivation_Short) size(Episodes) sort(First:ascending) label(Villain) tooltip(Villain, motivation_long, titles) 
| cloud x(Villain) color(motivation_short) size(LogStories) sort(first:ascending) top(episodes:80)
| x(motivation_short) y(first) color(episodes:gray) sort(episodes) label(villain) tooltip(titles) bin(first:10) sum(episodes) mode(villain) list(titles) legends(none)

 I was curious about when villains first appeared, so came up with this chart — stacking villains in their year of first appearance (click on it for the live version):

And here are a couple of additional samples I made along the way …

Blogging With Brunel

Published by:

 

Here’s a short video showing how to create a blog entry using Brunel in a few minutes. The data comes from The Guardian and is completely unmodified, as you can see in the video from the fairly odd column names! I’m making a cleaner version of the data and hope to have some samples of that up in a  few days.

The video is high resolution (1920 x 1080) and about 60M. It’s probably best viewed expanded out to full screen.

Brunel: Open Source Visualization Language

Published by:

BRUNEL is a high-level language that describes visualizations in terms of composable actions. It drives a visualization engine (d3) that performs the actual rendering and interactivity. It provides a language that is as simple as possible to describe a wide variety of potential charts, and to allow them to be used in Java, Javascript, python and R systems that want to deliver web-based interactive visualizations.


At the end of the article are a list of resources, but first, some examples. The dataset I am using for these is a set of data taken from BoardGameGeek which I processed to create a data set describing the top 2000 games listed as of Spring 2015. Each chart below is a fully interactive visualization running in its own frame. I’ve added the brunel description for each chart below each image as a caption, so you can go to the Builder anytime and copy the command into the edit box to try out new things.

data('sample:BGG Top 2000 Games.csv') bubble color(rating) size(voters) sort(rating) label(title) tooltip(title, #all) legends(none) style('* {font-size: 7pt}') top(rating:100)

This shows the top 100 games, with a tooltip view for details on the games. They are packed together in a layout where the location has no strong meaning
— the goal is to show as much data in as small a space as possible!
In the builder, you can change the number in top(rating:100) to show the top 1000, 2000 … or show the bottom 100. You could also add x(numplayers) to divide up the groups by recommended number of players

data('sample:BGG Top 2000 Games.csv') line x(published) y(categories) color(categories) size(voters:200) opacity(#selection) sort(categories) top(published:1900) sum(voters) legends(none) | data('sample:BGG Top 2000 Games.csv') bar y(voters) stack polar color(playerage) label(playerage) sum(voters) legends(none) at(15, 60, 40, 90) interaction(select:mouseover)

This example shows some live interactive features; hover over the pie chart to update the main chart. The main chart shows the number of people voting for games in different categories over time, and the pie chart shows the recommended minimum age to enjoy a game. So when you hover over ‘6’, for example, you can see that there have been no good sci-fi games for younger players in the last 10 years. Use the mouse to pan and zoom the chart (drag to pan, double-click to zoom).

data('sample:BGG Top 2000 Games.csv') treemap x(designer, mechanics) color(rating) size(#count) label(published) tooltip(#all, title) mean(rating) min(published) list(title:50) legends(none)

Head to the Builder Site to modify this. You could try:

  • change the list of fields in x(…) — reorder then or use fields like ‘numplayers’, ‘language’
  • remove the ‘legends(none)’ command to show a legend
  • change size to ‘voters’ — and add a ‘sum(voters)’ command to show the total number of voters rather than just counts for each treemap tile

Do you want to know more?

Follow links below; gallery and cookbook examples will take you to the Brunel Builder Site where you can create your own visualizations and grab some Javascript code to embed them in your web pages … which is exactly how I built the above examples!

A Quick Look at Lots of Songs …

Published by:

Songs by year, rating and genre

A Quick Look at Lots of Songs …

Songs by year, rating and genre

iTunes information about my songs, showing year, genre and my ratings

A quick visualization of the songs in my iTunes database. I was curious to see if there were any sweet spots in my listening history. As always, showing correlation between three different variables is hard, and here I wanted one dot per song, so the density is quite high (clicking on the image to show it full size is recommended).

Perhaps the most interesting thing for me personally was that I though i liked Alternative music more than I actually appear to. I notice especially that the 2010-2015 bin for mid-value rating is dominated by Alternative!

Appropriate Mappings

Published by:

Donating.vs.Death-Graph.0

Vox Article on viral memes and charitable giving

First, a disclaimer. This is not a post about the actual issues this article raises; just about the presentation of those claims. The image from the article has appeared in numerous places and been referenced by a number of news sources, as well as appearing in my Facebook and twitter feeds.

And it’s a bad image.

One minor issue is that it is hard to work out which circle relates to which disease, as the name of the disease only appears on the legend, so you are constantly moving your eyes from grey dot on left to the legend, to the grey dot on the right. Hard to make much sense. The fact that the legend doesn’t seem to have any order to it doesn’t help either. If this were 20 diseases instead of eight, the chart would be doomed!

Kudos for picking appropriate colors though. It helps that they used a natural mapping (pink <–> breast cancer; red <–> AIDS) that might help a bit.

The more worrying issue is that it makes a classic distortion mistake; look at the right side and rapidly answer the question, using just the images, not the text: “How many more deaths are there due to the purple disease than the blue disease?” 

Using the image as a guide, your answer is likely to be in the range 10 to 20 times as man, because the ratio of the areas is about that amount. When you look at the text, though, it’s actually only about four times. The numbers are not encoding the area, which is what we see, but they are encoding the radius (or diameter) which we do not immediately perceive.

The result is a sensationalist chart. It takes a real difference, but sensationalizes it by exaggerating the difference dramatically. If you want to use circles, map the variable of interest to AREA, not RADIUS. It fits our perceptions much more truthfully. It’s not actually perfect; we tend to see small circles as larger than they really are; but it’s much, much better).

So, here’s a reworking:

WhereWeDonate Vs. Diseases That Kill

I tried to keep close to the original color mappings, as they are pretty good, but have used width to encode the variable of interest, keeping the height of the rectangle fixed. I also labeled the items on both sides so we can see much more easily that heart disease kills about 4x as many people as Chronic Obstructive Pulmonary Disease. 

I also added some links between the two disease rankings to help visually link the two and aid navigation. The result is, I believe, not only more truthful, but easier to use. In short, it works.