Spring 2019: Approximate Methods to Compute Persistent Homology

Preparation for Research in Topological Data Analysis (TDA)

The primary objective of this project is to advance the application of methods of Topological Data Analysis (TDA) for machine learning and data mining on higher-dimensional, big data applications. While TDA shows great promise to discover knowledge beyond current data mining techniques, its computational and memory requirements have exponential growth: limiting its general use to point clouds containing less than 25K points in ℝ³. My objectives are to expand this limit by 3-5 orders of magnitude. This project will use data partitioning and parallelization techniques to attack the run time and memory requirements of locating the topological features in a Big Data point clouds. The technical details of the partitioning and parallelization we plan to use are described elsewhere. The purpose of this document is to provide direction and pointers into materials to assist the interested student to develop a background understanding of the mathematical and computational underpinnings of TDA.

Organizing Your Studies

Topology is an interesting and unusual field of mathematics that many of us do not have much background. Hopefully the materials I have highlight in the following sections will help you to get up to speed with this stuff. Most students will find this material quite new and confusing. It is easy to begin looking up many details that end up being insignificant to our final end game. That is, for one reason or another, quite a bit of the detailed formal mathematics becomes unimportant to our purposes. For example, the algebraic topology materials will worry about direction of the links between points; in reality this will become a non-issue. Thus, my recommendation is that that you run through the preliminary materials fairly quickly to get a "big picture" perspective of the topic. This is especially true in the early videos quickly and work only to capture the principle ideas, concepts, and vocabulary of these different topics. Most of the theorems, lemma's and proofs are interesting but not really all that critical. I recommend that you watch them with the goal of appreciating simply that the theories hold and the general approach that is used to prove them (some are pretty interesting arguments).

Finally, there is quite a bit of material and time involved in watching these videos. I strongly recommend that you play them at higher than normal speed. You might initially play them only at 1.25 speed until you get into the basic vocabulary and then move to higher speeds for subsequent videos. I will almost always play videos at 1.5 or 2.0 speed without difficulty (of course I am a native English speaker). In the end, do what works best for you.

Persistent Homology

The materials in this section are simply setting the context of TDA and persistent homology. Not strictly necessary, but nevertheless interesting.

Quick Overview of Persistent Homology

This short video gives a very easy to digest introduction to the key ideas of Persistent Homology which is the key method for computing topological features of a space. It also provides a quick clean demonstration of barcodes and what they are.

Discussions of TDA on point clouds and Robustness to Noise

Here's another 2 video lectures that are interesting. The first mostly restates the ideas described in the above video, the second actually discusses how to use statistical methods with TDA (persistent homology) to deal with noise.

Basic Background

Here are 18 YouTube video lessons titled: What is a Manifold. While I haven't yet finished all of them, the presentation is made from the perspective of point set topology. I strongly recommend them, so far I've thoroughly enjoyed watching them.

These next two sections are probably the most critical background materials that provide the general foundations from which you can begin to understand what we're talking about on this topic. I strongly recommend that you plan to study these materials in multiple iterations. Specifically I recommend that you watch Lectures 30 through 35 the "Introduction to Algebraic Homology" videos at high speed to get a general sense of Homology. Then I recommend a close study of the webpages of the second subsection. This should give you a good foundation from which to branch out your studies. In the third subsection, I have included links to the Napkin project which include introductory materials on a wide variety of mathematical topics.

Introductions to Topology and Homology

A great starting point to our work is captured in the 40 lecture video series on Algebraic Topology. If you want to dispense with the preliminaries and jump right in, then I strongly recommend you watch at least Lecture 30 and everything after it:

Lecture 30: Intro to Algebraic Homology, Part I
Lecture 31: Intro to Algebraic Homology, Part II

Overall I have found that all of the lectures are useful and I recommend you spend the time to watch them all. Do not get hung up on all the math and proofs. What you need is to understand the basic concepts and terms. Furthermore, do not get hung up on group theory. It is sufficient that you get the general gist of the ideas he is describing. As your knowledge in this area grows, you can backfill with other videos to complete your understanding of these additional topics. Furthermore, the next 5 part series is the key study materials I found that turned the flurry of concepts of this background material into reality.

Should you prefer a print medium for your studies, I read good things about these texts:

R. Ghrist, Elementary Applied Topology, ed. 1.0, Createspace, 2014.
J. P. May, A Concise Course in Algebraic Topology, University of Chicago Press, Sept 1999.

Introductory Tutorials on Topological Data Analysis

Here is the best, gentle, non-mathematical, tutorial on Topological Data Analysis that I have found to date. This is a 5 part series with links to the next page at the end of each. There are numerous formatting errors and a few actual errors in the materials, most of them are obvious. While the discussion introduces the two main analysis methods of TDA (namely: persistent homology and mapper), it appears that these 5 parts address only persistent homology. A nice feature of these pages is that they also discuss the construction of VR complexes. Lots of examples and code. In Part 5, the prose states that there is a separate set of pages for mapper, but I have not been able to locate them (if you do, please send me the link).

Here is another tutorial, but presented as a paper and not as a collection of web pages.

Frédéric Chazal and Bertrand Michel, An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists, ArXiv e-prints arXiv:1710.04019, Oct 2017.

Tutorials on Various Topics in Mathematics

The Napkin Project is a collection of training materials on mathematics that is geared to the non-mathematician. If you want some background in some math concept, this is probably where you should begin your studies.

Current State-of-the-art Tools for Computing Persistent Homology

Read everything in this section!!

The following documents are nice overviews of the state-of-the-art in the computation of persistent homology. They contain pointers to papers that demonstrate the utility of persistent homology in various fields and even describes some works that rely on accurate representation of minor topological features. A good overview of the general steps that are generally followed in computing persistent homology are also provided.

Nia Otter, Mason A. Porter, Ulrike Tillmann, Peter Grindrod and Heather A. Harrington, A Roadmap for the Computation of Persistent Homology, ArXiv e-prints arXiv:1506.08903, June 2017.
Chi Seng Pun, Kelin Xia, and Si Xian Lee, Persistent-Homology-based Machine Learning and its Applications — A Survey, ArXiv e-prints arXiv:1811.00252v1, Nov 2018.

Simplicial Complex Construction

The following papers discuss the construction and optimization of VR complex representations.

Afra Zomorodian, Fast Construction of the Vietoris-Rips Complex, Computer and Graphics, 34(3), 263-271, June 2010.
Jean-Daniel Boissonnat, Karthik C. S., and Sébastien Tavenas, Building Efficient and Compact Data Structures for Simplicial Complexes, Algorithmica, 79(2), 530--567, Oct 2017. Also available as: Jean-Daniel Boissonnat, Karthik C. S., and Sébastien Tavenas, Building Efficient and Compact Data Structures for Simplicial Complexes, ArXiv e-prints arXiv:1503.07444v4, Nov 2016. Building/optimizing the original: Jean-Daniel Boissonnat and Clément Maria, The Simplex Tree: An Efficient Data Structure for General Simplicial Complexes, Algorithmica 70(3), 406-427, Nov 2014.
Dominique Attali, André Lieutier, and David Salinas, Efficient Data Structure for Representing and Simplifying Simplicial Complexes in High Dimensions, Proceedings of the twenty-seventh annual symposium on Computational geometry (SoCG '11), 501-509, ACM, New York, NY, USA, 2011.

Computing Persistent Homology

Afra Zomorodian and Gunnar Carlsson, Computing Persistent Homology, Discrete Computational Geometry, 33(2), 249-274, February, 2005.
Chao Chen and Michael Kerber, Persistent homology computation with a twist, Proceedings 27th European Workshop on Computational Geometry, 2011
Vin de Silva, Dmitriy Morozov, and Mikael Vejdemo-Johansson, Persistent Cohomology and Circular Coordinates Journal of Discrete and Computational Geometry, 45(4), 737-759, June 2011.
Vin de Silva, Dmitriy Morozov, and Mikael Vejdemo-Johansson, Dualities in Persistent (Co)Homology, Inverse Problems, 27(12), Nov 2011. Also available at ArXiv e-prints arXiv:1107.5665v1.
Ripser is a high performance engine to compute VR persistence barcodes.
Ulrich Bauer, Ripser: Efficient Computation of Vietoris-Rips Persistence Barcodes (video presentation), Special Hausdorff Program Applied and Computational Algebraic Topology, Hausdorff Research Institute for Mathematics, Bonn, May 2, 2017.
Ulrich Bauer, Ripser: Efficient Computation of Vietoris-Rips Persistence Barcodes (presentation slides), Technical Institute of Munich, Computational and Stastical Aspects of Toploogical Data Analysis, Alan Turing Institute, March 23, 2017.
Ulrich Bauer, Computation of Persistent Homology, Part 2: Efficient Computation of Vietoris-Rips Persistence (presentation slides), Technical Institute of Munich, Tutorial on Multiparameter Persistence, Computation, and Applications, Institute for Mathematics and Its Applications, August 14, 2018.

Spring 2019 Breakout

More papers that have to be examined and classified.

Dominique Attali, André Lieutier, and David Salinas, Vietoris-Rips Complexes also Provide Topologically Correct Reconstructions of Sampled Shapes, Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry (SoCG '11), 491-500, ACM, New York, NY, USA, 2011.
Frédéric Chazal, David Cohen-Steiner, and Quentin Mérigot, Geometric Inference for Probability Measures, Foundations of Computational Mathematics 733-751, 11(6), December 2011.

1. Complexes

There are several types of complexes that are constructed to represent/capture the shape of a topological space. Some of the more common types are: cubical complexes, simplicial complexes, Δ complexes, Delaunay Triangulation, and CW complexes. While both cubical complexes and simplicial complexes are frequently used for computing persistent homology, simplicial are far more commonly used and they will be the focus of our studies. Cubical complex are generally used only with data sets that are well represented by n-cube decompositions (sometimes used for image analysis). If you want to study cubical complexes, I suggest you begin with this paper:

Hubert Wagner, Chao Chen, Erald Vuçini, Efficient Computation of Persistent Homology for Cubical Data, Mathematics and Visualization, 91-106, Springer Berlin Heidelberg, 2012.

2. Simplicial Complexes

Nice overview of complexes supported in the GUDHI C++ library
Notes on Complexes from Dey's OSU class
Herbert Edelsbrunner, The Union of Balls and its Dual Shape, Discrete & Computational Geometry, 415-440, 13(3), June 1995.
An earlier, abbreviated version also appears here: Herbert Edelsbrunner, The Union of Balls and its Dual Shape, Proceedings of the Ninth Annual Symposium on Computational Geometry (SCG '93), 218-231, ACM, New York, NY, USA, 1993.

3. Approximating/Sampling Simplicial Complexes

The following papers are closely related to our work to map a large point cloud to a smaller point cloud for computing Persistent Homology.

Donald R. Sheehy, Linear-Size Approximations to the Vietoris-Rips Filtration, Proceedings of the Twenty-Eighth Annual Symposium on Computational Geometry (SoCG '12), 239-248, ACM, New York, NY, USA, 2012.
Andrew J. Blumberg, Itamar Gal, Michael A. Mandell, and Matthew Pancia, Robust Statistics, Hypothesis Testing, and Confidence Intervals for Persistent Homology on Metric Measure Spaces, Foundations of Computational Mathematics, 14(4), 745--789, May 2014.
Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, and Larry Wasserman, Stochastic Convergence of Persistence Landscapes and Silhouettes Proceedings of the Thirtieth Annual Symposium on Computational Geometry, (SOCG'14) 474-483, 2014.
presentation slides
Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Bertrand Michel, Alessandro Rinaldo, and Larry Wasserman, Subsampling Methods for Persistent Homology, ArXiv e-prints arXiv:1406.1901, June 2014
Anindya Moitra, Nickolas O. Malott, and Philip A. Wilsey, Cluster-based Data Reduction for Persistent Homology, 2018 IEEE International Conference on Big Data (Big Data), pp. 327-334, Seattle, WA, USA, 2018,

Witness complexes and graph induced complexes are interesting techniques to obtain smaller simplicial complexes to use for computing homology.

Vin De Silva and Gunnar Carlsson, Topological estimation using witness complexes, Proceedings of the First Eurographics conference on Point-Based Graphics (SPBG'04), Marc Alexa, Markus Gross, Hanspeter Pfister, and Szymon Rusinkiewicz (Eds.). Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, 157-166, 2004. (alternate link).
Jean-Daniel Boissonnat, Leonidas J. Guibas, and Steve Y. Oudot, Manifold reconstruction in arbitrary dimensions using witness complexes Proceedings of the Twenty-Third Annual Symposium on Computational Geometry (SCG '07), 194-203, ACM, New York, NY, USA, 2007.
Expanded version also available:
Jean-Daniel Boissonnat, Leonidas J. Guibas, and Steve Y. Oudot, Manifold Reconstruction in Arbitrary Dimensions Using Witness Complexes, Discrete & Computational Geometry, 42(1) 37-70, July 2009.
Tamal Krishna Dey, Fengtao Fan, and Yusu Wang, Graph Induced Complex on Point Data, Proceedings of the Twenty-Ninth Annual Symposium on Computational Geometry (SoCG '13), 107-116, ACM, New York, NY, USA, 2013.

4. Optimizations to Simplicial Complexes

So one of the basic challenges for the application of persistent homology on large data sets is the size of the simplicial complexes. The above sampling methods and the witness complex directly attack this by removing some of the points used to construct the simplicial complex. In this section, we will examine techniques to optimize the constructed complex to a homotopy-equivalent smaller form.

Elementary collapse
N. J. Cavanna, M. Jahanseir, and D. R. Sheehy. A Geometric Perspective on Sparse Filtrations ArXiv e-prints arXiv:1506.03797v1, June 2015.
Batch Collapse:
- Tamal K. Dey, Dayu Shi, and Yusu Wang, SimBa: An Efficient Tool for Approximating Rips-filtration Persistence via Simplicial Batch-collapse, 24th Annual European Symposium on Algorithms (ESA), Aug 2016.
  (presentation slides)
- Tamal K. Dey, Dayu Shi, and Yusu Wang, SimBa: An Efficient Tool for Approximating Rips-filtration Persistence via Simplicial Batch Collapse, Journal of Experimental Algorithmics, 24(1), Feb 2019.
Discrete Morse Theory:
- Robin Forman, A User's Guide to Discrete Morse Theory, Séminaire Lotharingien de Combinatoire, 48 Article B48c, 2002.
- Discrete Morse theory - VideoLectures.NET (requires flash).

5. Parallel and Distributed Computing of PH

Ulrich Bauer, Michael Kerber, and Jan Reininghaus, Clear and Compress: Computing Persistent Homology in Chunks in Topological Methods in Data Analysis and Visualization III Springer, 2014.
Ulrich Bauer, Michael Kerber, and Jan Reininghaus, Distributed Computation of Persistent Homology, Proceedings of the Meeting on Algorithm Engineering & Expermiments, 31-38, Jan 2014.

6. Output Analysis

Barcodes:

Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas Guibas, Persistence barcodes for shapes, Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing (SGP '04), 124-135, ACM, New York, NY, USA 2004.

Persistence Diagrams:

David Cohen-Steiner, Herbert Edelsbrunner, and John Harer, Stability of Persistence Diagrams Journal of Discrete Computational Geometry, 37(1), 103-120, Jan 2007.
First introduced here: Herbert Edelsbrunner, David Letscher, and Afra Zomorodian, Topological Persistence and Simplification, Journal of Discrete & Computational Geometry, 511-533, 28(4), Nov 2002.

Persistence Landscapes:

Peter Bubenik, Statistical Topological Data Analysis using Persistence Landscapes, Journal of Machine Learning Research, 16(1), 77-102, Jan 2015. Also available at arXiv.org at: ArXiv e-prints arXiv:1207.6437v4, Jan 2015.
Peter Bubenik, The Persistence Landscape and Some of its Properties, ArXiv e-prints arXiv:1810.04963v2, Jan 2019.

Persistence Images:

Henry Adams, Sofya Chepushtanova, Tegan Emerson, Eric Hanson, Michael Kirby, Francis Motta, Rachel Neville, Chris Peterson, Patrick Shipman, and Lori Ziegelmeier, Persistence Images: A Stable Vector Representation of Persistent Homology, Journal of Machine Learning Research, 18(1), 218-252, Jan 2017.

Applications of Output Types

Frédéric Chazal, David Cohen-Steiner, Marc Glisse, Leonidas J. Guibas, and Steve Y. Oudot, Proximity of Persistence Modules and Their Diagrams, Proceedings of the Twenty-Fifth Annual Symposium on Computational Geometry (SCG '09), 237-246, ACM, New York, NY, USA, 2009.
Violeta Kovacev-Nikolic, Peter Bubenik, Dragan Nikoliç, and Giseon Heo, Using persistent homology and dynamical distances to analyze protein binding, ArXiv e-prints arXiv:1412.1394v2 July 2015.
Bernadette J. Stolz, Heather A. Harrington, and Mason A. Porter, Persistent homology of time-dependent functional networks constructed from coupled time series, Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(4), April 2017.

Justin M. Curry, Toplogical Data Analysis and Cosheaves, ArXiv e-prints arXiv:1411.0613v2 Mar 2015.

7. Mapper

A decent introduction to Mapper is contained in Chazal's tutorial manuscript introducing topological data analysis. I suggest you begin there.

Gurjeet Singh, Facundo Mémoli, and Gunnar E. Carlsson, Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Eurographics Symposium on Point-Based Graphics, M. Botsch, R. Pajarola (Editors), 2007.
Course webpage of U. of Iowa Math class on Data Analysis with TDA Mapper

8. Applications of TDA

Marco Piangerelli, Matteo Rucco, Luca Tesei, and Emanuela Merelli, Topological Classifier for Detecting the Emergence of Epileptic Seizures, BMC Research Notes, 11(1), Jun 2018
Paul Bendich, J. S. Marron, Ezra Miller, Alex Pieloch, and Sean Skwerer, Persistent Homology Analysis of Brain Artery Trees, The Annals of Applied Statistics, 198-218, 10(1) Mar 2016.
Jose A. Perea and John Harer, Sliding Windows and Persistence: An Application of Topological Methods to Signal Analysis, Foundations of Computational Mathematics, 799-838, 15(3), June 2015.
Li Li, Wei-Yi Cheng, Benjamin S. Glicksberg, Omri Gottesman, Ronald Tamler, Rong Chen, Erwin P. Bottinger, and Joel T. Dudley, Identification of Type 2 Diabetes Subgroups through Topological Analysis of Patient Similarity, Science Translational Medicine, 7(311), 2015.
P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, and G. Carlsson, Extracting Insights from the Shape of Complex Data using Topology, Scientif Reports, 3, Article number: 1236, Feb 2013.
Frédéric Chazal, Leonidas J. Guibas, Steve Y. Oudot, and Primoz Skraba, A Persistence-Based Clustering in Riemannian Manifolds, Journal of the ACM, 60(6), Jan 2013.
ToMATo: a Topological Mode Analysis Tool (video)
Source Code Another copy of the Source Code (including a link to the video)
Monica Nicolau, Arnold J. Levine, and Gunnar Carlsson, Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival, Proceedings of the National Academy Of Sciences (PNAS), 7265-7270, 108(17), Feb 2011.

Sheaves

I haven't really looked at these yet....

Algebraic geometry -- Sheaves
Series of lectures; i can't get sound to work with this. Tutorial on Sheaves in Data Analytics

Additional materials that might be helpful

Related Online Course Materials

OSU Course CSE 5559: Computational Topology and Data Analysis

Other Materials that I no longer recommend for background study

## Manifolds Before delving to far into Morse theory, it is probably a good idea to first review material on manifolds and such. I found [this lecture to be pretty good](https://www.youtube.com/watch?v=WCwoCFdjUcE). This is the same dude that does the Algebraic Topology series above and he recommends Lectures 17-20 to review manifolds. I have not yet watched those, but probably will back fill with them later. In the mean time, this lecture from his Differential Geometry series should be sufficient for our needs. ## Lectures on Topological Data Analysis (not really; actually a set of lectures on the Ayasdi System) So far I am really not impressed with these lectures. *Hope to find something else we can actually learn something from;* these are not it. The only one of them that has any bit of merit is Lecture 6. You will not learn much, but it is somewhat insteresting. I am not sure exactly how these [7 lectures come together](https://www.youtube.com/playlist?list=PLGIi7XCwrwYUQ_8nkhsFDiUnIzkPk7PKT), but some of them are interesting. Unfortunately they do not have much detail, so you are really just watching these to gain an idea of what is possible in TDA more than how to do TDA. My summary of these: 1. Lecture 1: useful but content well summarized at the start of Lecture 3; skip 2. Lecture 2: Absolutely horrible and useless; skip 3. Lecture 3: useful and largely subsumes lecture 1; overview, lacks detail 4. [Lecture 4:](https://www.youtube.com/watch?v=gtFVdGb9Y8w&index=4) yet another restatement of Lectures 1 and 3; slightly more detail (more examples in 3); scan both, but **focus on this one** 5. Lecture 5: interesting demonstration of the use of R for TDA (points us at the "Dream to Learn" webpage; looks to me like it is a lecture on using R learning tools and graphing. skipping (watched only 5 minutes at 2x) 6. [Lecture 6:](https://www.youtube.com/watch?v=kctyag2Xi8o&index=6) finally a useful lecture. **watch this one**. 7. Lecture 7: again another version of lectures 1, 3, and 4. that said, some of the examples are more interesting and more fully described than in the previous lectures. probably worth a high speed scan. ## Interesting Videos I really liked this video titled "[How much structure do we see in noise (a topological perspective)?](http://videolectures.net/solomon_skraba_structure_in_noise/)". It reviews the limits of persistence in noise and shows some limits on how we might begin to use to filter out noise in TDA based data analysis. Related to this talk is the paper in this directory named *bobrowski-16.pdf.bz2*. # Important Related Projects * [ToMATo: a Topological Mode Analysis Tool](https://github.com/locklin/tomato). Evidently it was taken [from](http://geometrica.saclay.inria.fr/data/ToMATo/) ## Homology Preserving Reduction Techniques ### Tidy Set * Afra Zomorodian, "The tidy set: A minimal simplicial set for computing homology of clique complexes," *Proc. ACM Symposium of Computational Geometry*, 2010. ### Morse Red * Konstantin Mischaikow and Vidit Nanda, *Morse theory for filtrations and efficient computation of persistent homology*, Discrete & Computational Geometry, 50, pp. 330-353, 2013. ## A Quick Intro to Approximate Computations As we get further into this, the idea of short circuiting the data and computing approximate results, averages, and so on are increasingly going to come into play. We are already somewhat familiar with this from the ideas to build structures to speed search (skip lists) and/or partitioning (count-min sketch). So far we've avoided some other, more aggressive approximations. Here is an intersting video titled ["Awesome Big Data Algorithms"](https://www.youtube.com/watch?v=jKBwGlYb13w) on the types of approximate methods that will hopefully broaden your thinking about how approximate computations can work. ## Morse Theory. Still looking for good lectures on this. Tentative here's [three lecture hours on Discrete Morse Theory](http://videolectures.net/computationaltopology2013_benedetti_theory/?q=morse%20theory).