RPHash is a streaming algorithm for data clustering based on frequent itemset counting of locality sensitive hash (LSH) collisions generated by multi-probe, randomly projected data vectors. The proposed algorithm called Random Projection Hashing or RPHash, trades sub-optimal computationally performance for improved memory efficiency over other clustering algorithms. Memory efficiency, a premium in streaming and distributed algorithms allows RPHash to be run on a variety of streaming data environments. Two additional features, temporal weighting and privacy are also provided intrinsically by RPHash.
The RPHash solution operates on static data as well as on streaming. An embedding of the code for Spark (map-reduce) processing is available. The essential components in RPHash approach are:
The main repositories for this project are maintained here: