Puzzle #9 (Follow-up): Binary Sudoku and the Limits of Visualization

In Puzzle #9: Island of Games, I described an island where a thousand residents enjoy playing three games: Chess, Checkers, and speed-solving of Rubik’s Cube. Each islander’s skill at each game is assessed with a rating score between zero and one.

I provided the ratings data along with some traditional summary statistics and charts. Here is a scatter plot showing the ratings for two games:

The puzzle was to find structure in what seems to be random noise.

The Answer

Let’s take another look at the scatter plot, but this time with color to reveal something about the third game, Rubik’s Cube:

This chart uses green and red markers for Rubik’s ratings above 0.5, and blue and yellow markers for Rubik’s ratings below 0.5.

Here is a 3D view of the ratings with the same color-coding:

These are the thousand data points of Puzzle #9. Their x-y-z coordinates are the three game ratings for each islander. The points are framed by a large cube, divided into eight small cubes. Only four small cubes contain data points.

The pattern could also be described as 3D binary Sudoku, which would be an extension of this mock Sudoku puzzle from xkcd:

The 3D structure is invisible to the head-on views of 2D scatter plots. The structure is not apparent if only 2 dimensions are considered at a time. This reminds me of the Borromean Rings:

Jim Belk

No two rings are linked by themselves, but taken together the three rings cannot be separated.

Intention of the Puzzle

We are accustomed to thinking about even the largest data sets with the shorthand of single numbers, such as the unemployment rate, the Dow Jones index, or a batting average. The pitfalls of single numbers include the Flaw of Averages (you can drown in a river that is three feet deep, on average).

Visual representations, such as histograms and scatter plots, can help. An array of scatter plots can be even more informative. But no matter what kind of visualization we use, how do we know that we aren’t missing something important in the data?

You might object that a 3D scatter plot would have revealed the secret to Puzzle #9, and you might argue that we just need to portray more dimensions in our visualizations. There certainly are ways to portray several dimensions at one time (Edward Tufte found seven dimensions in this graphic summary of Napoleon’s campaign in Russia). But there must be some limit to how many dimensions you can portray at a time. Let’s say that you can portray up to n dimensions at a time. I could always make a data checkerboard in n+1 dimensions that would be invisible to you, even with an array of all possible n-dimensional visualizations.

My point is that neither standard statistical analysis nor visualizations are guaranteed to reveal important features of the data. The puzzle is a reminder to be wary of blind spots in our analysis, in the spirit of Anscombe’s Quartet, a group of different data sets with identical statistical properties.

I had hoped that someone would attack the puzzle using newer techniques such as CART (Classification and Regression Trees), as reviewed by my friend Kirk Monteverde in this presentation. I expect that this approach would have identified the full and empty regions of the data set. However, those who solved the puzzle simply inspected the data and noticed the pattern (each individual either had high ratings for all three games or for exactly one game).

See also: Melting Fractals

5 Responses to Puzzle #9 (Follow-up): Binary Sudoku and the Limits of Visualization

Pingback: Puzzle #9: Melting Fractals | The Well-Tempered Spreadsheet
Pingback: Puzzle #9: Island of Games | The Well-Tempered Spreadsheet
Ariel says:

April 28, 2014 at 7:20 am

It might be interesting to mention parallel coordinates, as a way to visualize arbitrarily high-dimensional data.
I don’t know how easily one would detect such a checkerboard using parallel coordinates, but I suspect it to be quite easily detectable.

LikeLike

- Win Smith says:
  
  April 28, 2014 at 9:36 am
  
  Ariel,
  
  Thank you for your comment. I suspect that parallel coordinates might not show much unless we coded the data into multiple colors, as in the post above. It’s an interesting idea and I will take a look at it when I get a chance.
  
  Best,
  
  Win
  
  LikeLike
  
Pingback: Top 10 Capabilities for Exploring Complex Relationships in Data for Scientific Discovery - Dataconomy

	September in Paris \|… on New Paper on Bond Portfolio…
	Walter on Fast Formulas #3: Pool Average…
	Bond Talk on May 23… on New Paper on Bond Portfolio…
	Win Smith on Spam Followers
	Andy on Spam Followers
	Derek on Generating Sitemap Links with…
	The Invisible Run-Of… on Taming Premium Bonds
	Win Smith on Puzzle #7: The Mysterious…
	Elijah DePalma on Puzzle #7: The Mysterious…
	Rod on Puzzle #2: Weighted Average…