Preparation for Research in Topological Data Analysis (TDA)
The primary objective of this project is to advance the application of methods of Topological Data
Analysis (TDA) for machine learning and data mining on higher-dimensional, big data applications.
While TDA shows great promise to discover knowledge beyond current data mining techniques, its
computational and memory requirements have exponential growth: limiting its general use to point clouds
containing less than 25K points in ℝ3. My objectives are to expand this limit by 3-5
orders of magnitude. This project will use data partitioning and parallelization techniques to attack
the run time and memory requirements of locating the topological features in a Big Data point clouds.
The technical details of the partitioning and parallelization we plan to use are described elsewhere.
The purpose of this document is to provide direction and pointers into materials to assist the
interested student to develop a background understanding of the mathematical and computational
underpinnings of TDA.
Organizing Your Studies
Topology is an interesting and unusual field of mathematics that many of us do not have much background.
Hopefully the materials I have highlight in the following sections will help you to get up to speed with
this stuff. Most students will find this material quite new and confusing. It is easy to begin looking
up many details that end up being insignificant to our final end game. That is, for one reason or
another, quite a bit of the detailed formal mathematics becomes unimportant to our purposes. For example,
the algebraic topology materials will worry about direction of the links between points; in reality this
will become a non-issue. Thus, my recommendation is that that you run through the preliminary materials
fairly quickly to get a "big picture" perspective of the topic. This is especially true in the early
videos quickly and work only to capture the principle ideas, concepts, and vocabulary of these different
topics. Most of the theorems, lemma's and proofs are interesting but not really all that critical. I
recommend that you watch them with the goal of appreciating simply that the theories hold and the general
approach that is used to prove them (some are pretty interesting arguments).
Finally, there is quite a bit of material and time involved in watching these videos. I strongly
recommend that you play them at higher than normal speed. You might initially play them only at 1.25
speed until you get into the basic vocabulary and then move to higher speeds for subsequent videos. I
will almost always play videos at 1.5 or 2.0 speed without difficulty (of course I am a native English
speaker). In the end, do what works best for you.
Persistent Homology
The materials in this section are simply setting the context of TDA and persistent homology. Not strictly
necessary, but nevertheless interesting.
Quick Overview of Persistent Homology
This short video gives a very easy
to digest introduction to the key ideas of Persistent Homology which is the key method for computing
topological features of a space. It also provides a quick clean demonstration of barcodes and what they
are.
Discussions of TDA on point clouds and Robustness to Noise
Here's another 2 video lectures that are interesting. The first mostly restates the ideas described in the
above video, the second actually discusses how to use statistical methods with TDA (persistent homology) to
deal with noise.
- Introduction to TDA
- Statistical Techniques in TDA
Basic Background
Here are 18 YouTube video lessons
titled: What is a Manifold. While I haven't yet finished all of them, the presentation is
made from the perspective of point set topology. I strongly recommend them, so far I've thoroughly
enjoyed watching them.
These next two sections are probably the most critical background materials that provide the general
foundations from which you can begin to understand what we're talking about on this topic. I strongly
recommend that you plan to study these materials in multiple iterations. Specifically I recommend that
you watch Lectures 30 through 35 the "Introduction to Algebraic Homology" videos at high speed to get a
general sense of Homology. Then I recommend a close study of the webpages of the second subsection. This
should give you a good foundation from which to branch out your studies. In the third subsection, I have
included links to the Napkin project which include introductory materials on a wide variety of
mathematical topics.
Introductions to Topology and Homology
A great starting point to our work is captured in the 40 lecture video series on Algebraic Topology. If you want to dispense with
the preliminaries and jump right in, then I strongly recommend you watch at least Lecture
30 and everything after it:
- Lecture 30: Intro to
Algebraic Homology, Part I
- Lecture 31: Intro to
Algebraic Homology, Part II
Overall I have found that all of the lectures are useful and I recommend you spend the time to watch
them all. Do not get hung up on all the math and proofs. What you need is to understand the basic
concepts and terms. Furthermore, do not get hung up on group theory. It is sufficient that
you get the general gist of the ideas he is describing. As your knowledge in this area grows, you can
backfill with other videos to complete your understanding of these additional topics. Furthermore, the
next 5 part series is the key study materials I found that turned the flurry of concepts of this
background material into reality.
Should you prefer a print medium for your studies, I read good things about these texts:
- R. Ghrist, Elementary
Applied Topology, ed. 1.0, Createspace, 2014.
- J. P. May, A Concise Course in Algebraic
Topology, University of Chicago Press, Sept 1999.
Introductory Tutorials on Topological Data Analysis
Here is the best, gentle, non-mathematical,
tutorial on Topological Data Analysis that I have found to date. This is a 5 part series with links
to the next page at the end of each. There are numerous formatting errors and a few actual errors in
the materials, most of them are obvious. While the discussion introduces the two main analysis methods
of TDA (namely: persistent homology and mapper), it appears that these 5 parts address
only persistent homology. A nice feature of these pages is that they also discuss the construction of
VR complexes. Lots of examples and code. In Part 5, the prose states that there is a separate
set of pages for mapper, but I have not been able to locate them (if you do, please send me the link).
Here is another tutorial, but presented as a paper and not as a collection of web pages.
- Frédéric Chazal and Bertrand Michel, An introduction to Topological Data Analysis: fundamental and practical aspects for
data scientists, ArXiv e-prints arXiv:1710.04019, Oct 2017.
Tutorials on Various Topics in Mathematics
The Napkin Project is a collection of
training materials on mathematics that is geared to the non-mathematician. If you want some background in
some math concept, this is probably where you should begin your studies.
Current State-of-the-art Tools for Computing Persistent Homology
Read everything in this section!!
The following documents are nice overviews of the state-of-the-art in the computation of persistent
homology. They contain pointers to papers that demonstrate the utility of persistent homology in various
fields and even describes some works that rely on accurate representation of minor topological features.
A good overview of the general steps that are generally followed in computing persistent
homology are also provided.
- Nia Otter, Mason A. Porter, Ulrike Tillmann, Peter Grindrod and Heather A. Harrington,
A Roadmap for the Computation of Persistent
Homology, ArXiv e-prints arXiv:1506.08903, June 2017.
- Chi Seng Pun, Kelin Xia, and Si Xian Lee, Persistent-Homology-based Machine Learning and its Applications — A
Survey, ArXiv e-prints arXiv:1811.00252v1, Nov 2018.
Simplicial Complex Construction
The following papers discuss the construction and optimization of VR complex representations.
- Afra Zomorodian, Fast
Construction of the Vietoris-Rips Complex, Computer and Graphics, 34(3), 263-271,
June 2010.
- Jean-Daniel Boissonnat, Karthik C. S., and Sébastien Tavenas,
Building Efficient and Compact Data
Structures for Simplicial Complexes, Algorithmica, 79(2), 530--567, Oct 2017.
Also available as: Jean-Daniel Boissonnat, Karthik C. S., and Sébastien Tavenas,
Building Efficient and Compact Data
Structures for Simplicial Complexes, ArXiv e-prints arXiv:1503.07444v4, Nov 2016.
Building/optimizing the original: Jean-Daniel Boissonnat and Clément
Maria, The Simplex Tree: An
Efficient Data Structure for General Simplicial Complexes, Algorithmica 70(3),
406-427, Nov 2014.
- Dominique Attali, André Lieutier, and David Salinas,
Efficient Data Structure for
Representing and Simplifying Simplicial Complexes in High Dimensions, Proceedings of the
twenty-seventh annual symposium on Computational geometry (SoCG '11), 501-509, ACM, New
York, NY, USA, 2011.
Computing Persistent Homology
- Afra Zomorodian and Gunnar Carlsson,
Computing Persistent
Homology, Discrete Computational Geometry, 33(2), 249-274, February, 2005.
- Chao Chen and Michael Kerber,
Persistent
homology computation with a twist, Proceedings 27th European Workshop on Computational
Geometry, 2011
- Vin de Silva, Dmitriy Morozov, and Mikael Vejdemo-Johansson,
Persistent Cohomology and Circular
Coordinates Journal of Discrete and Computational Geometry, 45(4), 737-759, June
2011.
- Vin de Silva, Dmitriy Morozov, and Mikael Vejdemo-Johansson,
Dualities in Persistent
(Co)Homology, Inverse Problems, 27(12), Nov 2011. Also available at
ArXiv e-prints
arXiv:1107.5665v1.
- Ripser is a high performance engine to
compute VR persistence barcodes.
- Ulrich Bauer, Ripser: Efficient
Computation of Vietoris-Rips Persistence Barcodes (video presentation), Special Hausdorff
Program Applied and Computational Algebraic Topology, Hausdorff Research Institute
for Mathematics, Bonn, May 2, 2017.
- Ulrich Bauer, Ripser: Efficient
Computation of Vietoris-Rips Persistence Barcodes (presentation slides), Technical Institute of
Munich, Computational and Stastical Aspects of Toploogical Data Analysis, Alan Turing Institute, March
23, 2017.
- Ulrich Bauer, Computation of
Persistent Homology, Part 2: Efficient Computation of Vietoris-Rips Persistence (presentation
slides), Technical Institute of Munich, Tutorial on Multiparameter Persistence, Computation, and
Applications, Institute for Mathematics and Its Applications, August 14, 2018.
Spring 2019 Breakout
More papers that have to be examined and classified.
- Dominique Attali, André Lieutier, and David Salinas,
Vietoris-Rips Complexes also
Provide Topologically Correct Reconstructions of Sampled Shapes, Proceedings of the
Twenty-Seventh Annual Symposium on Computational Geometry (SoCG '11), 491-500, ACM, New York,
NY, USA, 2011.
- Frédéric Chazal, David Cohen-Steiner, and Quentin
Mérigot, Geometric
Inference for Probability Measures, Foundations of Computational Mathematics 733-751,
11(6), December 2011.
1. Complexes
There are several types of complexes that are constructed to represent/capture the shape of a topological
space. Some of the more common types are:
cubical complexes,
simplicial complexes,
Δ complexes, Delaunay
Triangulation, and CW
complexes. While both cubical complexes and simplicial complexes are frequently used for computing
persistent homology, simplicial are far more commonly used and they will be the focus of our studies.
Cubical complex are generally used only with data sets that are well represented by n-cube decompositions
(sometimes used for image analysis). If you want to study cubical complexes, I suggest you begin with
this paper:
2. Simplicial Complexes
- Nice overview of complexes
supported in the GUDHI C++ library
- Notes on
Complexes from Dey's OSU class
- Herbert Edelsbrunner, The Union of Balls
and its Dual Shape, Discrete & Computational Geometry, 415-440, 13(3), June
1995.
An earlier, abbreviated version also appears here: Herbert
Edelsbrunner, The Union of Balls and its
Dual Shape, Proceedings of the Ninth Annual Symposium on Computational Geometry (SCG '93),
218-231, ACM, New York, NY, USA, 1993.
3. Approximating/Sampling Simplicial Complexes
The following papers are closely related to our work to map a large point cloud to a smaller point
cloud for computing Persistent Homology.
- Donald R. Sheehy, Linear-Size
Approximations to the Vietoris-Rips Filtration, Proceedings of the Twenty-Eighth Annual
Symposium on Computational Geometry (SoCG '12), 239-248, ACM, New York, NY, USA, 2012.
- Andrew J. Blumberg, Itamar Gal, Michael A. Mandell, and Matthew
Pancia, Robust
Statistics, Hypothesis Testing, and Confidence Intervals for Persistent Homology on Metric Measure
Spaces, Foundations of Computational Mathematics, 14(4), 745--789, May 2014.
- Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, and Larry
Wasserman, Stochastic Convergence of
Persistence Landscapes and Silhouettes Proceedings of the Thirtieth Annual Symposium on
Computational Geometry, (SOCG'14) 474-483, 2014.
presentation slides
- Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Bertrand Michel, Alessandro
Rinaldo, and Larry Wasserman, Subsampling
Methods for Persistent Homology, ArXiv e-prints arXiv:1406.1901, June 2014
- Anindya Moitra, Nickolas O. Malott, and Philip A. Wilsey,
Cluster-based Data Reduction for
Persistent Homology, 2018 IEEE International Conference on Big Data (Big Data), pp. 327-334,
Seattle, WA, USA, 2018,
Witness complexes and graph induced complexes are interesting techniques to obtain smaller simplicial
complexes to use for computing homology.
- Vin De Silva and Gunnar Carlsson, Topological estimation using witness complexes, Proceedings of the First
Eurographics conference on Point-Based Graphics (SPBG'04), Marc Alexa, Markus Gross, Hanspeter
Pfister, and Szymon Rusinkiewicz (Eds.). Eurographics Association, Aire-la-Ville, Switzerland,
Switzerland, 157-166,
2004. (alternate link).
- Jean-Daniel Boissonnat, Leonidas J. Guibas, and Steve
Y. Oudot, Manifold reconstruction in
arbitrary dimensions using witness complexes Proceedings of the Twenty-Third Annual Symposium on
Computational Geometry (SCG '07), 194-203, ACM, New York, NY, USA, 2007.
Expanded version
also available:
Jean-Daniel Boissonnat, Leonidas J. Guibas, and Steve
Y. Oudot, Manifold Reconstruction in
Arbitrary Dimensions Using Witness Complexes, Discrete & Computational
Geometry, 42(1) 37-70, July 2009.
- Tamal Krishna Dey, Fengtao Fan, and Yusu Wang,
Graph Induced Complex on
Point Data, Proceedings of the Twenty-Ninth Annual Symposium on Computational
Geometry (SoCG '13), 107-116, ACM, New York, NY, USA, 2013.
4. Optimizations to Simplicial Complexes
So one of the basic challenges for the application of persistent homology on large data sets is
the size of the simplicial complexes. The above sampling methods and the witness complex
directly attack this by removing some of the points used to construct the simplicial complex. In this
section, we will examine techniques to optimize the constructed complex to a homotopy-equivalent smaller
form.
- Elementary
collapse
- N. J. Cavanna, M. Jahanseir, and D. R. Sheehy. A Geometric Perspective on Sparse Filtrations ArXiv e-prints
arXiv:1506.03797v1, June 2015.
- Batch Collapse:
- Tamal K. Dey, Dayu Shi, and Yusu Wang, SimBa: An Efficient Tool for Approximating Rips-filtration Persistence via
Simplicial Batch-collapse, 24th Annual European Symposium on Algorithms (ESA), Aug
2016.
(presentation
slides)
- Tamal K. Dey, Dayu Shi, and Yusu Wang,
SimBa: An Efficient Tool for Approximating Rips-filtration Persistence via Simplicial Batch
Collapse, Journal of Experimental Algorithmics, 24(1), Feb 2019.
- Discrete Morse
Theory:
5. Parallel and Distributed Computing of PH
- Ulrich Bauer, Michael Kerber, and Jan Reininghaus,
Clear
and Compress: Computing Persistent Homology in Chunks in Topological Methods in Data Analysis
and Visualization III Springer, 2014.
- Ulrich Bauer, Michael Kerber, and Jan Reininghaus,
Distributed Computation of
Persistent Homology, Proceedings of the Meeting on Algorithm Engineering & Expermiments,
31-38, Jan 2014.
6. Output Analysis
Barcodes:
- Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas Guibas,
Persistence barcodes for
shapes, Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry
Processing (SGP '04), 124-135, ACM, New York, NY, USA 2004.
Persistence Diagrams:
- David Cohen-Steiner, Herbert Edelsbrunner, and John Harer,
Stability of Persistence
Diagrams Journal of Discrete Computational Geometry, 37(1), 103-120, Jan 2007.
- First introduced here: Herbert Edelsbrunner, David Letscher, and Afra Zomorodian,
Topological Persistence and
Simplification, Journal of Discrete & Computational Geometry, 511-533, 28(4),
Nov 2002.
Persistence Landscapes:
Persistence Images:
- Henry Adams, Sofya Chepushtanova, Tegan Emerson, Eric Hanson, Michael Kirby, Francis Motta,
Rachel Neville, Chris Peterson, Patrick Shipman, and Lori Ziegelmeier,
Persistence Images: A Stable
Vector Representation of Persistent Homology, Journal of Machine Learning
Research, 18(1), 218-252, Jan 2017.
Applications of Output Types
- Frédéric Chazal, David Cohen-Steiner, Marc Glisse, Leonidas J. Guibas, and Steve
Y. Oudot, Proximity of Persistence
Modules and Their Diagrams, Proceedings of the Twenty-Fifth Annual Symposium on Computational
Geometry (SCG '09), 237-246, ACM, New York, NY, USA, 2009.
- Violeta Kovacev-Nikolic, Peter Bubenik, Dragan Nikoliç, and Giseon Heo,
Using persistent homology and dynamical
distances to analyze protein binding, ArXiv e-prints arXiv:1412.1394v2 July 2015.
- Bernadette J. Stolz, Heather A. Harrington, and Mason A. Porter,
Persistent homology of time-dependent
functional networks constructed from coupled time series, Chaos: An Interdisciplinary Journal
of Nonlinear Science, 27(4), April 2017.
7. Mapper
A decent introduction to Mapper is contained in Chazal's tutorial manuscript introducing topological data analysis. I
suggest you begin there.
8. Applications of TDA
- Marco Piangerelli, Matteo Rucco, Luca Tesei, and Emanuela Merelli,
Topological Classifier
for Detecting the Emergence of Epileptic Seizures, BMC Research Notes,
11(1), Jun 2018
- Paul Bendich, J. S. Marron, Ezra Miller, Alex Pieloch, and Sean Skwerer,
Persistent Homology Analysis of Brain Artery
Trees, The Annals of Applied Statistics, 198-218, 10(1) Mar 2016.
- Jose A. Perea and John Harer, Sliding Windows and Persistence: An Application of Topological Methods to
Signal Analysis, Foundations of Computational Mathematics, 799-838, 15(3),
June 2015.
- Li Li, Wei-Yi Cheng, Benjamin S. Glicksberg, Omri Gottesman, Ronald Tamler, Rong Chen, Erwin
P. Bottinger, and Joel T. Dudley, Identification of Type 2 Diabetes Subgroups through Topological Analysis of Patient
Similarity, Science Translational Medicine, 7(311), 2015.
- P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan,
J. Carlsson, and G. Carlsson, Extracting
Insights from the Shape of Complex Data using Topology, Scientif Reports, 3, Article
number: 1236, Feb 2013.
- Frédéric Chazal, Leonidas J. Guibas, Steve Y. Oudot, and Primoz
Skraba, A Persistence-Based Clustering in Riemannian Manifolds, Journal of the ACM,
60(6), Jan 2013.
ToMATo: a Topological Mode
Analysis Tool (video)
Source Code
Another copy of the Source Code (including a
link to the video)
- Monica Nicolau, Arnold J. Levine, and Gunnar Carlsson,
Topology based data analysis
identifies a subgroup of breast cancers with a unique mutational profile and excellent
survival, Proceedings of the National Academy Of Sciences (PNAS), 7265-7270, 108(17), Feb
2011.
Sheaves
I haven't really looked at these yet....
- Algebraic geometry -- Sheaves
- Series of lectures; i can't get sound to work with
this. Tutorial on Sheaves in Data Analytics
Additional materials that might be helpful
Related Online Course Materials
- OSU Course CSE
5559: Computational Topology and Data Analysis
Other Materials that I no longer recommend for background study
## Manifolds
Before delving to far into Morse theory, it is probably a good idea to first review material on manifolds and
such. I found [this lecture to be pretty good](https://www.youtube.com/watch?v=WCwoCFdjUcE). This is the
same dude that does the Algebraic Topology series above and he recommends Lectures 17-20 to review manifolds.
I have not yet watched those, but probably will back fill with them later. In the mean time, this lecture
from his Differential Geometry series should be sufficient for our needs.
## Lectures on Topological Data Analysis (not really; actually a set of lectures on the Ayasdi System)
So far I am really not impressed with these lectures. *Hope to find something else we can actually
learn something from;* these are not it. The only one of them that has any bit of merit is
Lecture 6. You will not learn much, but it is somewhat insteresting.
I am not sure exactly how these
[7 lectures come together](https://www.youtube.com/playlist?list=PLGIi7XCwrwYUQ_8nkhsFDiUnIzkPk7PKT),
but some of them are interesting. Unfortunately they do not have much detail, so you are really just
watching these to gain an idea of what is possible in TDA more than how to do TDA. My summary of
these:
1. Lecture 1: useful but content well summarized at the start of Lecture 3; skip
2. Lecture 2: Absolutely horrible and useless; skip
3. Lecture 3: useful and largely subsumes lecture 1; overview, lacks detail
4. [Lecture 4:](https://www.youtube.com/watch?v=gtFVdGb9Y8w&index=4) yet another restatement of
Lectures 1 and 3; slightly more detail (more examples in 3); scan both, but **focus on this one**
5. Lecture 5: interesting demonstration of the use of R for TDA (points us at the "Dream to Learn"
webpage; looks to me like it is a lecture on using R learning tools and graphing. skipping
(watched only 5 minutes at 2x)
6. [Lecture 6:](https://www.youtube.com/watch?v=kctyag2Xi8o&index=6) finally a useful lecture.
**watch this one**.
7. Lecture 7: again another version of lectures 1, 3, and 4. that said, some of the examples are
more interesting and more fully described than in the previous lectures. probably worth a high
speed scan.
## Interesting Videos
I really liked this video titled "[How much structure do we see in noise (a topological
perspective)?](http://videolectures.net/solomon_skraba_structure_in_noise/)". It reviews the limits
of persistence in noise and shows some limits on how we might begin to use to filter out noise in
TDA based data analysis. Related to this talk is the paper in this directory named
*bobrowski-16.pdf.bz2*.
# Important Related Projects
* [ToMATo: a Topological Mode Analysis Tool](https://github.com/locklin/tomato). Evidently it was
taken [from](http://geometrica.saclay.inria.fr/data/ToMATo/)
## Homology Preserving Reduction Techniques
### Tidy Set
* Afra Zomorodian, "The tidy set: A minimal simplicial set for computing homology of clique
complexes," *Proc. ACM Symposium of Computational Geometry*, 2010.
### Morse Red
* Konstantin Mischaikow and Vidit Nanda, *Morse theory for filtrations and efficient computation of
persistent homology*, Discrete & Computational Geometry, 50, pp. 330-353, 2013.
## A Quick Intro to Approximate Computations
As we get further into this, the idea of short circuiting the data and computing approximate
results, averages, and so on are increasingly going to come into play. We are already somewhat
familiar with this from the ideas to build structures to speed search (skip lists) and/or
partitioning (count-min sketch). So far we've avoided some other, more aggressive approximations.
Here is an intersting video titled
["Awesome Big Data Algorithms"](https://www.youtube.com/watch?v=jKBwGlYb13w) on the types of
approximate methods that will hopefully broaden your thinking about how approximate computations can
work.
## Morse Theory.
Still looking for good lectures on this.
Tentative here's
[three lecture hours on Discrete Morse Theory](http://videolectures.net/computationaltopology2013_benedetti_theory/?q=morse%20theory).