Spring 2019: Approximate Methods to Compute Persistent Homology

Preparation for Research in Topological Data Analysis (TDA)

The primary objective of this project is to advance the application of methods of Topological Data Analysis (TDA) for machine learning and data mining on higher-dimensional, big data applications. While TDA shows great promise to discover knowledge beyond current data mining techniques, its computational and memory requirements have exponential growth: limiting its general use to point clouds containing less than 25K points in ℝ3. My objectives are to expand this limit by 3-5 orders of magnitude. This project will use data partitioning and parallelization techniques to attack the run time and memory requirements of locating the topological features in a Big Data point clouds. The technical details of the partitioning and parallelization we plan to use are described elsewhere. The purpose of this document is to provide direction and pointers into materials to assist the interested student to develop a background understanding of the mathematical and computational underpinnings of TDA.


Organizing Your Studies

Topology is an interesting and unusual field of mathematics that many of us do not have much background. Hopefully the materials I have highlight in the following sections will help you to get up to speed with this stuff. Most students will find this material quite new and confusing. It is easy to begin looking up many details that end up being insignificant to our final end game. That is, for one reason or another, quite a bit of the detailed formal mathematics becomes unimportant to our purposes. For example, the algebraic topology materials will worry about direction of the links between points; in reality this will become a non-issue. Thus, my recommendation is that that you run through the preliminary materials fairly quickly to get a "big picture" perspective of the topic. This is especially true in the early videos quickly and work only to capture the principle ideas, concepts, and vocabulary of these different topics. Most of the theorems, lemma's and proofs are interesting but not really all that critical. I recommend that you watch them with the goal of appreciating simply that the theories hold and the general approach that is used to prove them (some are pretty interesting arguments).

Finally, there is quite a bit of material and time involved in watching these videos. I strongly recommend that you play them at higher than normal speed. You might initially play them only at 1.25 speed until you get into the basic vocabulary and then move to higher speeds for subsequent videos. I will almost always play videos at 1.5 or 2.0 speed without difficulty (of course I am a native English speaker). In the end, do what works best for you.

Persistent Homology

The materials in this section are simply setting the context of TDA and persistent homology. Not strictly necessary, but nevertheless interesting.

Quick Overview of Persistent Homology

This short video gives a very easy to digest introduction to the key ideas of Persistent Homology which is the key method for computing topological features of a space. It also provides a quick clean demonstration of barcodes and what they are.

Discussions of TDA on point clouds and Robustness to Noise

Here's another 2 video lectures that are interesting. The first mostly restates the ideas described in the above video, the second actually discusses how to use statistical methods with TDA (persistent homology) to deal with noise.
  1. Introduction to TDA
  2. Statistical Techniques in TDA

Basic Background

Here are 18 YouTube video lessons titled: What is a Manifold. While I haven't yet finished all of them, the presentation is made from the perspective of point set topology. I strongly recommend them, so far I've thoroughly enjoyed watching them.

These next two sections are probably the most critical background materials that provide the general foundations from which you can begin to understand what we're talking about on this topic. I strongly recommend that you plan to study these materials in multiple iterations. Specifically I recommend that you watch Lectures 30 through 35 the "Introduction to Algebraic Homology" videos at high speed to get a general sense of Homology. Then I recommend a close study of the webpages of the second subsection. This should give you a good foundation from which to branch out your studies. In the third subsection, I have included links to the Napkin project which include introductory materials on a wide variety of mathematical topics.

Introductions to Topology and Homology

A great starting point to our work is captured in the 40 lecture video series on Algebraic Topology. If you want to dispense with the preliminaries and jump right in, then I strongly recommend you watch at least Lecture 30 and everything after it:

  1. Lecture 30: Intro to Algebraic Homology, Part I
  2. Lecture 31: Intro to Algebraic Homology, Part II

Overall I have found that all of the lectures are useful and I recommend you spend the time to watch them all. Do not get hung up on all the math and proofs. What you need is to understand the basic concepts and terms. Furthermore, do not get hung up on group theory. It is sufficient that you get the general gist of the ideas he is describing. As your knowledge in this area grows, you can backfill with other videos to complete your understanding of these additional topics. Furthermore, the next 5 part series is the key study materials I found that turned the flurry of concepts of this background material into reality.

Should you prefer a print medium for your studies, I read good things about these texts:

  1. R. Ghrist, Elementary Applied Topology, ed. 1.0, Createspace, 2014.
  2. J. P. May, A Concise Course in Algebraic Topology, University of Chicago Press, Sept 1999.

Introductory Tutorials on Topological Data Analysis

Here is the best, gentle, non-mathematical, tutorial on Topological Data Analysis that I have found to date. This is a 5 part series with links to the next page at the end of each. There are numerous formatting errors and a few actual errors in the materials, most of them are obvious. While the discussion introduces the two main analysis methods of TDA (namely: persistent homology and mapper), it appears that these 5 parts address only persistent homology. A nice feature of these pages is that they also discuss the construction of VR complexes. Lots of examples and code. In Part 5, the prose states that there is a separate set of pages for mapper, but I have not been able to locate them (if you do, please send me the link).

Here is another tutorial, but presented as a paper and not as a collection of web pages.

  1. Frédéric Chazal and Bertrand Michel, An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists, ArXiv e-prints arXiv:1710.04019, Oct 2017.

Tutorials on Various Topics in Mathematics

The Napkin Project is a collection of training materials on mathematics that is geared to the non-mathematician. If you want some background in some math concept, this is probably where you should begin your studies.


Current State-of-the-art Tools for Computing Persistent Homology

Read everything in this section!!

The following documents are nice overviews of the state-of-the-art in the computation of persistent homology. They contain pointers to papers that demonstrate the utility of persistent homology in various fields and even describes some works that rely on accurate representation of minor topological features. A good overview of the general steps that are generally followed in computing persistent homology are also provided.

  1. Nia Otter, Mason A. Porter, Ulrike Tillmann, Peter Grindrod and Heather A. Harrington, A Roadmap for the Computation of Persistent Homology, ArXiv e-prints arXiv:1506.08903, June 2017.
  2. Chi Seng Pun, Kelin Xia, and Si Xian Lee, Persistent-Homology-based Machine Learning and its Applications — A Survey, ArXiv e-prints arXiv:1811.00252v1, Nov 2018.

Simplicial Complex Construction

The following papers discuss the construction and optimization of VR complex representations.

  1. Afra Zomorodian, Fast Construction of the Vietoris-Rips Complex, Computer and Graphics, 34(3), 263-271, June 2010.
  2. Jean-Daniel Boissonnat, Karthik C. S., and Sébastien Tavenas, Building Efficient and Compact Data Structures for Simplicial Complexes, Algorithmica, 79(2), 530--567, Oct 2017. Also available as: Jean-Daniel Boissonnat, Karthik C. S., and Sébastien Tavenas, Building Efficient and Compact Data Structures for Simplicial Complexes, ArXiv e-prints arXiv:1503.07444v4, Nov 2016. Building/optimizing the original: Jean-Daniel Boissonnat and Clément Maria, The Simplex Tree: An Efficient Data Structure for General Simplicial Complexes, Algorithmica 70(3), 406-427, Nov 2014.
  3. Dominique Attali, André Lieutier, and David Salinas, Efficient Data Structure for Representing and Simplifying Simplicial Complexes in High Dimensions, Proceedings of the twenty-seventh annual symposium on Computational geometry (SoCG '11), 501-509, ACM, New York, NY, USA, 2011.

Computing Persistent Homology


Spring 2019 Breakout

More papers that have to be examined and classified.

1. Complexes

There are several types of complexes that are constructed to represent/capture the shape of a topological space. Some of the more common types are: cubical complexes, simplicial complexes, Δ complexes, Delaunay Triangulation, and CW complexes. While both cubical complexes and simplicial complexes are frequently used for computing persistent homology, simplicial are far more commonly used and they will be the focus of our studies. Cubical complex are generally used only with data sets that are well represented by n-cube decompositions (sometimes used for image analysis). If you want to study cubical complexes, I suggest you begin with this paper:

2. Simplicial Complexes

3. Approximating/Sampling Simplicial Complexes

The following papers are closely related to our work to map a large point cloud to a smaller point cloud for computing Persistent Homology.

Witness complexes and graph induced complexes are interesting techniques to obtain smaller simplicial complexes to use for computing homology.

4. Optimizations to Simplicial Complexes

So one of the basic challenges for the application of persistent homology on large data sets is the size of the simplicial complexes. The above sampling methods and the witness complex directly attack this by removing some of the points used to construct the simplicial complex. In this section, we will examine techniques to optimize the constructed complex to a homotopy-equivalent smaller form.

5. Parallel and Distributed Computing of PH

6. Output Analysis

Barcodes:

Persistence Diagrams:

Persistence Landscapes:

Persistence Images:

Applications of Output Types

7. Mapper

A decent introduction to Mapper is contained in Chazal's tutorial manuscript introducing topological data analysis. I suggest you begin there.

8. Applications of TDA

Sheaves

I haven't really looked at these yet....

  1. Algebraic geometry -- Sheaves
  2. Series of lectures; i can't get sound to work with this. Tutorial on Sheaves in Data Analytics

Additional materials that might be helpful

Related Online Course Materials

  1. OSU Course CSE 5559: Computational Topology and Data Analysis








Other Materials that I no longer recommend for background study

## Manifolds Before delving to far into Morse theory, it is probably a good idea to first review material on manifolds and such. I found [this lecture to be pretty good](https://www.youtube.com/watch?v=WCwoCFdjUcE). This is the same dude that does the Algebraic Topology series above and he recommends Lectures 17-20 to review manifolds. I have not yet watched those, but probably will back fill with them later. In the mean time, this lecture from his Differential Geometry series should be sufficient for our needs. ## Lectures on Topological Data Analysis (not really; actually a set of lectures on the Ayasdi System) So far I am really not impressed with these lectures. *Hope to find something else we can actually learn something from;* these are not it. The only one of them that has any bit of merit is Lecture 6. You will not learn much, but it is somewhat insteresting. I am not sure exactly how these [7 lectures come together](https://www.youtube.com/playlist?list=PLGIi7XCwrwYUQ_8nkhsFDiUnIzkPk7PKT), but some of them are interesting. Unfortunately they do not have much detail, so you are really just watching these to gain an idea of what is possible in TDA more than how to do TDA. My summary of these: 1. Lecture 1: useful but content well summarized at the start of Lecture 3; skip 2. Lecture 2: Absolutely horrible and useless; skip 3. Lecture 3: useful and largely subsumes lecture 1; overview, lacks detail 4. [Lecture 4:](https://www.youtube.com/watch?v=gtFVdGb9Y8w&index=4) yet another restatement of Lectures 1 and 3; slightly more detail (more examples in 3); scan both, but **focus on this one** 5. Lecture 5: interesting demonstration of the use of R for TDA (points us at the "Dream to Learn" webpage; looks to me like it is a lecture on using R learning tools and graphing. skipping (watched only 5 minutes at 2x) 6. [Lecture 6:](https://www.youtube.com/watch?v=kctyag2Xi8o&index=6) finally a useful lecture. **watch this one**. 7. Lecture 7: again another version of lectures 1, 3, and 4. that said, some of the examples are more interesting and more fully described than in the previous lectures. probably worth a high speed scan. ## Interesting Videos I really liked this video titled "[How much structure do we see in noise (a topological perspective)?](http://videolectures.net/solomon_skraba_structure_in_noise/)". It reviews the limits of persistence in noise and shows some limits on how we might begin to use to filter out noise in TDA based data analysis. Related to this talk is the paper in this directory named *bobrowski-16.pdf.bz2*. # Important Related Projects * [ToMATo: a Topological Mode Analysis Tool](https://github.com/locklin/tomato). Evidently it was taken [from](http://geometrica.saclay.inria.fr/data/ToMATo/) ## Homology Preserving Reduction Techniques ### Tidy Set * Afra Zomorodian, "The tidy set: A minimal simplicial set for computing homology of clique complexes," *Proc. ACM Symposium of Computational Geometry*, 2010. ### Morse Red * Konstantin Mischaikow and Vidit Nanda, *Morse theory for filtrations and efficient computation of persistent homology*, Discrete & Computational Geometry, 50, pp. 330-353, 2013. ## A Quick Intro to Approximate Computations As we get further into this, the idea of short circuiting the data and computing approximate results, averages, and so on are increasingly going to come into play. We are already somewhat familiar with this from the ideas to build structures to speed search (skip lists) and/or partitioning (count-min sketch). So far we've avoided some other, more aggressive approximations. Here is an intersting video titled ["Awesome Big Data Algorithms"](https://www.youtube.com/watch?v=jKBwGlYb13w) on the types of approximate methods that will hopefully broaden your thinking about how approximate computations can work. ## Morse Theory. Still looking for good lectures on this. Tentative here's [three lecture hours on Discrete Morse Theory](http://videolectures.net/computationaltopology2013_benedetti_theory/?q=morse%20theory).