Topological Data Analysis (TDA) is an approach that focuses on studying the `shape' or topological structures (loops, holes, and voids) of data in order to extract meaningful information. The ability of TDA to identify shapes despite certain deformations in the space renders it immune to noise and leads to discovering properties of data that are not discernible by conventional methods of data analysis. One of the principal methods for performing Topological Data Analysis is called Persistent Homology. Persistent homology can be informally defined as a process for computing topological features of data with increasing spatial resolutions. The input data for the computation of persistent homology is represented as a point cloud. The output is a set of real number pairs (the birth and death times) documenting the spatial resolutions where each topological feature first appears (birth) and when it disappears (death). The pairs are usually plotted either as a set of lines, called barcodes, as a set of points in a 2D plane, called a persistence diagram, or as a persistent landscape. My work with Topological Data Analysis (TDA) is focused primarily on the development of methods to accelerate the computation of Persistent Homology. Persistent homology can be informally defined as a method for computing topological features of a data set (point cloud) at different spatial resolutions. It gives a multiscale view of the topology of a space by capturing the evolution of topology with increasing (or decreasing) resolutions. Unfortunately, the worst case time and space complexities of the computation of persistent homology are O(n^{3}) and O(n^{2}) respectively, where n is the total number of simplices in the filtration. Our work studies the development of partitioning, parallelism, and approximate methods to accelerate the computation of persistent homology. Most recently we have also initiated studies to perform persistent homology on streaming data. Each of these studies are explained in more detail below.
This work considers a method of partitioning the point cloud and parallelizing the computation of Persistent Homology (PH). The approach permits the computation of PH on much larger point clouds than previously possible. An outline of the general approach and how topological features in the original point cloud will be correctly identified in this approach are described below. In rare cases, a topological feature may be lost. The remainder of this proposal will use the term "identify a topological feature" to mean that the PH computation will return birth/death values that define a valid barcode for the topological feature in the point cloud.
A high level outline of the steps in this method is described as follows:
^{†} While the convex hull of a set P is generally defined as the boundary around the smallest convex set of points that contain P, this proposal uses the term convex hull around a topological feature (a hole, void, or loop) to denote the set of points forming the boundary around the topological feature (in all dimensions of that void) precisely at the minimum ε distance when the feature first appears.
This approach begins by transforming a point cloud P that is to large for computing persistent homology to another point cloud P' with fewer total points so that the persistent homology can be computed. Ideally, the transformation should be such that P and P' characterize the same topological structures. An example of a suitable mapping is illustrated in Figure 1 (representing point cloud P) and Figure 2 (representing point cloud P'). The general approach is to partition the original point cloud and use the geometric centers of the partitions as the set of points in P'. The method for partitioning the point cloud can have significant impact on feature preservation. In addition to preserving the topological structure, the ideal partition methods must be able to be efficiently computed on large, high-dimensional data sets. The partitioning approach is desired over sampling methods because it provides a framework for regional computations of PH to locate smaller topological features that can be hidden in the mapping from P to P'.
Fig 1 Flamingo image at 26K points |
Fig 2 Flamingo image at 300 points |
Our preliminary experiments suggest that spherical clustering methods such as k-means++ can present a suitable partitioning of the data. However, considerably more experimental data must be collected with multiple alternatives such as geometric partitioning, Locality Sensitive Hashing, streaming clustering and so on.
There are limits to representing the point cloud through fewer and fewer partition centroids; at some point, the topologically significant features of the original data will fail to emerge. It is important to understand how and when these features might disappear. Furthermore, in rare cases it is possible that a topological feature will be lost or shifted to a lower dimension (when the centroid data does not form complexes surrounding the feature in all dimensions around the convex hull of the feature in the full data set). While this is possible, our preliminary experiments have failed to produce any of these errors. Finally, it is also possible for a feature to shift into a higher dimension. The upscaling step will resolve this issue and push the topological feature definition into the correct dimension.
We are building the software tools to implement the ideas described above. These tools are freely available on github as the: Lightweight Homology Framework.