MassDAL

MassDAL Public Code Bank :

Sketches, Frequent Items, Changes  (Deltoids)

This page is a library of routines in C and Java for data streaming and other massive data set analysis.
Currently, these are derived from internal testing routines ("proof of concept" implementations), and
so are distributed with no guarantees. In particular, there is relatively little error checking
(parameters in range, memory allocation errors) in these implementations.

Description of C Code:  (Contact Graham Cormode at graham at dimacs.rutgers.edu)

Title
Code
Notes
Count-Min Sketch
countmin.h
countmin.c
Count-Min sketches (Cormode, Muthukrishnan '04). They support:
Finding frequent items, Returning point estimates, Approximating Inner-products
Finding quantiles
Group testing
cgt.h
cgt.c
Described in What's Hot and What's Not: Tracking Frequent Items Dynamically,
PODS 2003.  They support: finding frequent items.
CCFC Sketch
ccfc.h
ccfc.c
Count sketches from Charikar, Chen, Farach-Colton '02. They support:  finding
frequent items, returning point estimates
AMS Sketch
ams.h
ams.c
"Tug of War" sketches due to Alon, Matias and Szegedy 96, and Alon, Gibbons,
Matias and Szegedy, 99.  Some hashing tricks used for faster implementation.
They support: Returning point estimates, Approximating Inner-products, Estimating
the L2 norm of a vector
Stable distributions
and sketches
stable.h
stable.c
Sketches based on stable distributions. Based on Indyk'00, Cormode, Indyk,
Koudas, Muthukrishnan '02,  and Cormode, Data, Indyk, Muthukrishnan, '02.
They support  estimating the L2 norm of a vector,  estimating the L1 norm of a
vector, estimating the L0 norm of a vector, estimating arbitrary Lp norms for 0 < p < 2
Random number
generators
prng.h
prng.c

Common routines
massdal.h
massdal.c

Change (deltoid)
detection
change.h
change.c
Implements the change detection group testing for absolute, relative, and variational changes (Cormode-Muthukrishnan, Infocom 2004).
See also the sample file changewrapper.c for examples of usage.
Frequent items
frequent.h
frequent.c
Algorithm Frequent for insert only streams. They are derived from code used in the paper What's Hot and What's Not: Tracking Frequent Items Dynamically, PODS 2003.
Lossy Counting
lossycount.h
lossycount.c
Lossy Counting algorithm due to Manku and Motwani













hotitems.c

teststab.c

changewrapper.c
makefile
Each of the .h files exports various methods:
Init -- Initialize the data structure. Takes some parameters defining the accuracy and quality of the results. Returns a pointer to an object which must be supplied to the following routines.
Update -- Takes an item identifier. Some routines take an additional parameter of the count to add on for this arrival (positive or negative increment). Others interpret a positive count as an arrival, a negative count as a departure.
Output -- Takes an integer threshold, returns a list of items which should exceed this threshold.
Size -- Returns the space used by the data structure
Destroy -- Destroys the data structure and frees the associated memory.

Example usage of the above routines, finding frequent items on a stream drawn from zipf distributions.
Example usage of the above routines to estimate the L1, L2, and L0 norms of data streams.
Example usage of the change.c library to find items with large changes.
makefile A makefile for the above code
Code BUNDLE
massdalsketches.zip All the above .c and .h files together in one easy to download .zip file.

Description of Java Code: (contact Claudio Tancioni at c.tancioni at tiscali.it)

Title
Code
Notes
Lossy Counting
Lossy-Counting-Java.zip
This works for text streams. README.txt file explains usage.



The following licence applies to all code distributed here. Creative Commons
LicenseThis work is licensed under a Creative Commons License.