# Topological Data Analysis (TDA)

Topological Data Analysis (TDA) is an approach that focuses on studying the `shape' or topological structures (loops, holes, and voids) of data in order to extract meaningful information. The ability of TDA to identify shapes despite certain deformations in the space renders it immune to noise and leads to discovering properties of data that are not discernible by conventional methods of data analysis. One of the principal methods for performing Topological Data Analysis is called Persistent Homology. Persistent homology can be informally defined as a process for computing topological features of data with increasing spatial resolutions. The input data for the computation of persistent homology is represented as a point cloud. The output is a set of real number pairs (the birth and death times) documenting the spatial resolutions where each topological feature first appears (birth) and when it disappears (death). The pairs are usually plotted either as a set of lines, called barcodes, as a set of points in a 2D plane, called a persistence diagram, or as a persistent landscape. My work with Topological Data Analysis (TDA) is focused primarily on the development of methods to accelerate the computation of Persistent Homology. Persistent homology can be informally defined as a method for computing topological features of a data set (point cloud) at different spatial resolutions. It gives a multiscale view of the topology of a space by capturing the evolution of topology with increasing (or decreasing) resolutions. Unfortunately, the worst case time and space complexities of the computation of persistent homology are O(n3) and O(n2) respectively, where n is the total number of simplices in the filtration. Our work studies the development of partitioning, parallelism, and approximate methods to accelerate the computation of persistent homology. Most recently we have also initiated studies to perform persistent homology on streaming data. Each of these studies are explained in more detail below.

## Partitioning and Sampling to Compute Persistent Homology

This work considers a method of partitioning the point cloud and parallelizing the computation of Persistent Homology (PH). The approach permits the computation of PH on much larger point clouds than previously possible. An outline of the general approach and how topological features in the original point cloud will be correctly identified in this approach are described below. In rare cases, a topological feature may be lost. The remainder of this proposal will use the term "identify a topological feature" to mean that the PH computation will return birth/death values that define a valid barcode for the topological feature in the point cloud.

A high level outline of the steps in this method is described as follows:

1. Partition the point cloud so that each point is assigned to one and only one partition.
2. CentroidPH: Use the geometric center (centroid) point representation from each partition to compute/estimate PH of the original point cloud. This computation will identify the barcodes for the larger topological features in the point cloud. The smaller features will be identified by a regional PH computation described in step Regional below.
3. Upscaling: For each topological feature identified in step CentroidPH above, upscale the data in the partitions on the boundary of the feature and recompute the PH on the restored data from the partitions on the feature boundary (this step can be performed in parallel for each feature). Upscaling can be performed iteratively (by repartitioning/upscaling) to grow the restored data in cases where a one step upscaling presents too many points for computing PH. While the total number of points forming the convex hull of the topological feature could, in rare cases, exceed the ability to compute PH on all of the boundary points; in this case, however, the upscaling still has the potential to reduce the error bounds of the computed homology.
4. RegionalPH: For each partition defined in step Partition above, compute PH on a region of points within and around the partition (these PH computations can occur in parallel among each other as well as concurrently to steps CentroidPH and upscaling above). The size of this region to ensure that all topological features in dimension 2 and above are properly identified is defined below.
5. Merging: Any duplicate features found by the regional PH computations can be eliminated by creating an arbitrary total order on the points in the original data set and having each regional computation report only barcodes for topological features where the lowest ordered point in the convex hull of that feature is a member of the partition at the center of that regional computation. The final step is to remove duplicate barcodes computed from the centroid based PH computation that are also discovered in a regional PH computation. The regionally computed barcode is preserved as it will generally have a more precise computation of the feature birth/death times.

While the convex hull of a set P is generally defined as the boundary around the smallest convex set of points that contain P, this proposal uses the term convex hull around a topological feature (a hole, void, or loop) to denote the set of points forming the boundary around the topological feature (in all dimensions of that void) precisely at the minimum ε distance when the feature first appears.

This approach begins by transforming a point cloud P that is to large for computing persistent homology to another point cloud P' with fewer total points so that the persistent homology can be computed. Ideally, the transformation should be such that P and P' characterize the same topological structures. An example of a suitable mapping is illustrated in Figure 1 (representing point cloud P) and Figure 2 (representing point cloud P'). The general approach is to partition the original point cloud and use the geometric centers of the partitions as the set of points in P'. The method for partitioning the point cloud can have significant impact on feature preservation. In addition to preserving the topological structure, the ideal partition methods must be able to be efficiently computed on large, high-dimensional data sets. The partitioning approach is desired over sampling methods because it provides a framework for regional computations of PH to locate smaller topological features that can be hidden in the mapping from P to P'.

 Fig 1 Flamingo image at 26K points Fig 2 Flamingo image at 300 points

Our preliminary experiments suggest that spherical clustering methods such as k-means++ can present a suitable partitioning of the data. However, considerably more experimental data must be collected with multiple alternatives such as geometric partitioning, Locality Sensitive Hashing, streaming clustering and so on.

There are limits to representing the point cloud through fewer and fewer partition centroids; at some point, the topologically significant features of the original data will fail to emerge. It is important to understand how and when these features might disappear. Furthermore, in rare cases it is possible that a topological feature will be lost or shifted to a lower dimension (when the centroid data does not form complexes surrounding the feature in all dimensions around the convex hull of the feature in the full data set). While this is possible, our preliminary experiments have failed to produce any of these errors. Finally, it is also possible for a feature to shift into a higher dimension. The upscaling step will resolve this issue and push the topological feature definition into the correct dimension.

We are building the software tools to implement the ideas described above. These tools are freely available on github as the: Lightweight Homology Framework.

## Compting Persistent Homology on Streaming Data

This work is still very much in an embronic state and details of our solution remain highly fluid at this time. Therefore, I will refrain from elaborating in detail at this time. If you would like to work on this or have questions, please feel free to drop by my office and so we can discuss it further.
• Online steps
• Setup an optional streamingKmeans++ frontend
• Sliding window of points
• We currently have a working algorithm for incremental complex construction/destruction.
• We are working the various cases for the incremental update to the t-array.
• Offline steps
• Upscaling (if needed).
• Translate to persistent landscapes and compare to reference landscapes