Before I transferred to UIC, I was a graduate student at the University of Florida. While I was there studying topology with Alexander Dranishnikov, UF hired a new director of the honors program who had a math background – Kevin Knudson. He had a lot of interest in algebraic topology, and specifically applied AT, that he shared with us. This is a really new field and has seen a lot of growth in the past decade, with researchers like Rob Ghrist at UPenn and Gunnar Carlsson at Stanford producing a lot of great ways to use algebraic topology to answer actual problems.
In fact, the Stanford Computation Topology group has just spun off a company, Ayasdi, that deals specifically with exploratory data analysis by using methods developed in pure math. They’ve published a collection of videos that make the power of their approach obvious, and they’ve also made some white papers available on some of the under-the-hood details. The main paper is Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. As a bit of a programming exercise I thought I’d try to reproduce their Mapper tool, which looks to have developed into Ayasdi’s Iris.
They give a brief description of some of the possible applications of their approach and I decided to try to work out one of these on my own. The relevant structure that they apply is a coarse version of the Reeb graph, a construction from Morse theory. This graph encodes the connectivity structure of a Morse function.
The classical application: given a height function, the Reeb graph encodes the number of connected pieces at a particular height. Here and in all similar images in this post, the first arrow represents the function (eg height) and the squiggly arrow represents building the Reeb graph. For instance, in the image you can see that the shape has circular cross sections all in the middle, so those cross sections only have 1 connected piece. Around the hoops, the cross-sections are 2 circles, so there are 2 connected components and the Reeb graph at that height has 2 points.
So this might seem a little silly if you’re new to this. What’s the big deal? Well, there are a few things. First, the Reeb graph can be applied to shapes that exist in any dimension. For a shape that can be embedded in 3-space, the Reeb graph is really pretty obvious. The problem is that most data sets measure very many different things (hundreds of dimensions!) so you can’t just look at it and see what the connections are. The Reeb graph provides a natural way of simplifying this data in a combinatorial structure.
Next, the function does not have to be a height function. In fact, I generally think that a height function is sort of the most boring thing you can choose. Instead, you can choose a function which represents something that you expect to be interesting about the data set. A natural candidate for this would be a density estimator. In other words, are there a lot of points near a particular part of this space (high density) or are the points all far away (low density). Here’s a sample data set (generated by x and y coordinates that are normally distributed) that illustrates the idea.
The idea is that points in the center have a high density and points near the periphery have a low density. If we had this data, probably one thing we’d really like to know is where to find the densest concentration of points – ie where are they clustering? Here’s the same data with the points colored according to their density (darker = denser):
The next step is to embed this 2-d data into 3-d space by placing each point at a height that corresponds with its density – more dense = higher. (You may need to click on the image to animate it.)
Ok so how do we apply the Reeb construction to this? Well, let’s start with something that looks sort of like our randomly generated data set and apply the Reeb construction to that.
This is really nice because applying the Reeb graph to the density estimation has automatically pulled out the densest cluster as the top vertex! This is generally how this construction will work – you’ll see a peak vertex in the Reeb graph for every cluster in the data. Note that these top-level clusters don’t need to just be simple things like the data shown here, you can get clusters with interesting shapes.
As an aside, you might be thinking “couldn’t you just run some sort of clustering algorithm from the beginning here?” Of course the answer is yes, but with some major caveats. First, clustering is hard work for computers! If you have 100000000 data points, no chance of that working. Of course, what I’m doing here is also hard work for computers so that’s a rather weak point…. Second, this also lets you see how the clusters are related. Third, most clustering algorithms have an output which is just a list indicating which points belong to which clusters (or maybe a dendrogram or something depending on the algorithm) which I doubt would generate much insight. This outputs a nice graph that lets you see what’s happening! Lastly, this reduces the data in a very natural way that preserves a lot of the structure you might be worried about. Once the data has been simplified in this way, I’ve read somewhere that machine learning techniques will perform better. Seems plausible.
There remains the question of how we go from the given data set to our topological estimation of the data set. The answer is that we fudge it. Because our data is finite and sampled, and so not continuous or precise, we don’t need to use a continuous Reeb graph. Instead, we build a finite or approximate Reeb graph by considering only a small number of subintervals. Then we chop up the set and apply the Reed idea to that whole subinterval, not just to a single value as above.
Still, I haven’t done anything with the data so it doesn’t look like this has helped at all. However, consider the following: if you pick a random value for density, how many points have EXACTLY that density? The answer is (depending on your notion of random) going to be zero. On the other hand, if you pick a range of densities, then you have a good chance of capturing a bunch of points. That’s exactly what we use here.
To see this in action, I took the data set from above and sliced the densities into 12 possible ranges. On the image below, I’m including the 2nd least dense, 5th least dense, 9th least dense and most dense subsets. The least dense subset is hard to even notice since it’s only 2 points (IIRC) and they’re sort of close to the 5th least dense subset. Since I’ve excluded many of the density levels (eg 10 and 11) you get gaps between the data.
Ok so now is the time to apply some sort of clustering algorithm, even though I was dismissive of them before. However, we only apply the algorithm to each subset of points, which dramatically improves on the performance of the algorithm. If you just look at the picture above, you might guess that each interval has one circular component. That is approximately correct – the least dense points are not connected enough to get the whole circle, just subarcs (approximate). If we apply a clustering algorithm to all of these above level 3, we do indeed get a single component at each level, giving us something like the following Reeb graph (note that the orientation of the graph is determined randomly when its drawn and each vertex has a difficult to read label that indicated its density level).
Unfortunately, I didn’t save the exact data so I resampled and this one might be a tiny bit different. It’s a bit tattered near the bottom and there are some low-density floaters but that’s to be expected with a low-density data set – noise becomes more important! Nonetheless, you can immediately see at least one statistically important fact from this graph: the data is unimodal (ie like a bell curve – just one peak)! If you hadn’t seen the pictures that I showed you earlier, you wouldn’t know that. For example, if the data was 9 dimensional or something. Although I guess this is only half accurate since that cluster could have a significant amount of extra topological structure (it might be shaped weird) that you’re not seeing. That can be determined afterwards anyway. Here’s another data set and the accompanying graph:
You can see that the data has two peaks that are joined by a region of medium density. The graph has two ‘leaves’ that are joined a few vertices away. Again, the bottom of the graph is pretty messy and there are some floaters around but the overall structure is intact.
One thing that I still haven’t mentioned is where the edges come from. To see this, we should take a look at another data set, this time a very simple 1D set.
Let’s group these points according to their densities (here we just have 4 levels) and notice the key fact about our intervals in this example – many of them overlap:
And now apply the clustering algorithm to each density interval. For intervals that overlap, we include edges. Red boxes are clusters.
Now we just scrunch up all the clusters into single points to get the graph.
So now what do you do with this? Well, you can see how the data is shaped and how the clusters are related, qualitatively. You can run some sort of statistical test to understand what makes each cluster stand out, perhaps restricting your data set to just those high-density points (saving you mad flops). You can also take the data in each cluster and try to understand its shape, perhaps by running a similar process on it. You could also use some actual applied algebraic topology to do this by calculating the persistent homologies of each cluster.
There is one thing that I think really makes this approach powerful. After you stratify your data set according to this method, it’s simple to filter the data according to the observed structure. Then, you can apply either a battery of standard statistical tests to the simplified data (as noted above) or you can apply the Reeb construction a second (and third and fourth…) time using a different function for each iteration. What’s nice is that the major computational hurdles for this process are calculating the pairwise distances and building the graph but you only need to calculate the distances once, so every following iteration is cheaper than the first and since calculating the pairwise distances is , the savings are significant. Then, you can figure out reasonably quickly which things are the defining characteristics of each cluster and so forth.
I’ll finish with a few notes on this technique. First, the clustering algorithm I used is single-linkage. This suffers from a well-known flaw – two “obviously” distinct clusters can be identified as the same cluster if there’s a “bridge” between them. In other words, two large clusters can be turned into one cluster by connecting them with a narrow string of points because the algorithm only requires one nearby point to join things. I suspect that stratifying the data according to density goes a long way towards fixing this issue. Since SL is among the simplest clustering algorithms, overcoming this issue may make it the preferred choice in this context.
Second, here I constructed a Reeb graph, but the output need not be just a graph. In fact, the dimension of the output structure is effectively determined by the dimension of the range. For instance, if I had a domain specific function then I would get a “Reeb 2-complex.” Aside from the increased complexity of working with higher dimensional structures, the process is unchanged.
Third, the code I wrote still requires some serious hand holding. Specifically, there are two coefficients that I produced ad hoc – one in the density method and one in the clustering method. The density estimator functions by calculating and the value of needs to be determined beforehand. I haven’t worked this out completely but I suspect that should depend on the size of the data set and the dimension of the space and that it can probably be determined experimentally with relative ease.
In single linkage clustering, a maximum distance value, , must be provided by the user. I figured this out by trial and error for each case but my instinct is that is should depend only on density of the given cluster and dimension of the whole space. I haven’t spent any time trying to justify that claim. In any case, if that proves difficult to determine, one can always use DBSCAN or OPTICS instead and make a determination of the cluster structure from the output of those algorithms.