In Puzzle #9: Island of Games, I described an island where a thousand residents enjoy playing three games: Chess, Checkers, and speed-solving of Rubik’s Cube. Each islander’s skill at each game is assessed with a rating score between zero and one.
I provided the ratings data along with some traditional summary statistics and charts. Here is a scatter plot showing the ratings for two games:
The puzzle was to find structure in what seems to be random noise.
Let’s take another look at the scatter plot, but this time with color to reveal something about the third game, Rubik’s Cube:
This chart uses green and red markers for Rubik’s ratings above 0.5, and blue and yellow markers for Rubik’s ratings below 0.5.
Here is a 3D view of the ratings with the same color-coding:
These are the thousand data points of Puzzle #9. Their x-y-z coordinates are the three game ratings for each islander. The points are framed by a large cube, divided into eight small cubes. Only four small cubes contain data points.
The pattern could also be described as 3D binary Sudoku, which would be an extension of this mock Sudoku puzzle from xkcd:
The 3D structure is invisible to the head-on views of 2D scatter plots. The structure is not apparent if only 2 dimensions are considered at a time. This reminds me of the Borromean Rings:
No two rings are linked by themselves, but taken together the three rings cannot be separated.
Intention of the Puzzle
We are accustomed to thinking about even the largest data sets with the shorthand of single numbers, such as the unemployment rate, the Dow Jones index, or a batting average. The pitfalls of single numbers include the Flaw of Averages (you can drown in a river that is three feet deep, on average).
Visual representations, such as histograms and scatter plots, can help. An array of scatter plots can be even more informative. But no matter what kind of visualization we use, how do we know that we aren’t missing something important in the data?
You might object that a 3D scatter plot would have revealed the secret to Puzzle #9, and you might argue that we just need to portray more dimensions in our visualizations. There certainly are ways to portray several dimensions at one time (Edward Tufte found seven dimensions in this graphic summary of Napoleon’s campaign in Russia). But there must be some limit to how many dimensions you can portray at a time. Let’s say that you can portray up to n dimensions at a time. I could always make a data checkerboard in n+1 dimensions that would be invisible to you, even with an array of all possible n-dimensional visualizations.
My point is that neither standard statistical analysis nor visualizations are guaranteed to reveal important features of the data. The puzzle is a reminder to be wary of blind spots in our analysis, in the spirit of Anscombe’s Quartet, a group of different data sets with identical statistical properties.
I had hoped that someone would attack the puzzle using newer techniques such as CART (Classification and Regression Trees), as reviewed by my friend Kirk Monteverde in this presentation. I expect that this approach would have identified the full and empty regions of the data set. However, those who solved the puzzle simply inspected the data and noticed the pattern (each individual either had high ratings for all three games or for exactly one game).
Copyright 2014. All Rights Reserved.
See also: Melting Fractals