Large high-dimensional clustering

28.01.2015

Aims & objectives

Today’s scientific and industrial data-sets often cover all aspects of the well-known big data characteristics (value, velocity, variety, veracity, volume). Especially in astronomy, data analysis methods with big data compatibility are key to solve the problem statements of this field. Based on pre-processed data dedicated features are extracted. In most cases a model-driven approach is chosen to generate those features. Both, the extracted features as well as the uncertainties of the model-fitting are stored in relational databases with the original data aside. Therefore scientists have to define selection criteria explicitly in order to retrieve the objects of interest. Instead of working on the original data the analysis is limited to the pre-extracted features only. This requires to have according features in the database and an a-priori knowledge of the nature of the requested objects. Rare and odd objects are hard to be detected or filtered for follow-up analysis. To allow for a more explorative access to the scientific data, unsupervised methods like clustering and outlier-detection are helpful. Clustering in scientific environments is a challenging task caused by the complexity and size of the data. Already current data-sets can no longer be analyzed efficiently with the current scheme. New upcoming projects will increase exponentially in size and complexity, e.g. the Square Kilometre Array (SKA) archive will be limited to 1 Exabyte caused by the costs projected in 2011. The aim of this project is to provide an powerful method to analyze large data-sets based on similarities of items in high-dimensions. In this research project the science case is analyzing unlabeled data-sets from the Sloan Digital Sky Survey 3, Data Release 10 (SDSS3 DR10) with a focus on similarity/dissimilarity relationship of two objects. Each object is represented by a feature vector of approx. 5000 dimensions. Those vectors display a numeric value of a captured spectrum with uncorrelated noise for all specific wavelengths. The whole data-set consists of 3 million objects with 60 GBytes of raw data, total. As all objects have to be compared with each other the resulting complexity is O(n2) in computation and storage.

Reflecting the science-case mentioned, a naive full analysis with distance- , density clustering algorithms would end in 542 days of processing time using a single similarity measure on a 128 cores computing-cluster. This assumes a time of 2 ms per comparison including loading and saving data. Effectively the 9 × 1012 comparisons would produce between 24 TByte and 120 PByte of resulting data depending on the level of detail. This project aims at developing a method analyzing this data in an acceptable timeframe.

Research topics

This project develops a method for efficient clustering of large high dimensional data-sets in astronomy. Important aspects are:

Development of efficient algorithms solving n^2 problems in lower computational and storage complexity
Desiging similarity metrics for domain specific research questions
Integrating similiarity metrics within the algorithm
Extending similarity metrics with uncertainty quantification methods improving the result quality and creating an controllable noise tolerant metric

Consortium

Astroinformatics Group (AIN)
Data Mining and Uncertainty Quantification Group (DMQ)

People

Maximilian Hoecker (DMQ)
Vincent Heuveline
Kai Lars Polsterer (AIN)
Dennis Kügler (AIN)

Contact

Maximilian Hoecker