Large high-dimensional clustering

28. January 2015

Aims & objectives

Today’s scientific and industrial data-sets often cover all aspects of the well-known big data characteristics (value, velocity, variety, veracity, volume). Especially in astronomy, data analysis methods with big data compatibility are key to solve the problem statements of this field. Based on pre-processed data dedicated features are extracted. In most cases a model-driven approach is chosen to generate those features. Both, the extracted features as well as the uncertainties of the model-fitting are stored in relational databases with the original data aside. Therefore scientists have to define selection criteria explicitly in order to retrieve the objects of interest. Instead of working on the original data the analysis is limited to the pre-extracted features only. This requires to have according features in the database and an a-priori knowledge of the nature of the requested objects. Rare and odd objects are hard to be detected or filtered for follow-up analysis. To allow for a more explorative access to the scientific data, unsupervised methods like clustering and outlier-detection are helpful. Clustering in scientific environments is a challenging task caused by the complexity and size of the data. Already current data-sets can no longer be analyzed efficiently with the current scheme. New upcoming projects will increase exponentially in size and complexity, e.g. the Square Kilometre Array (SKA) archive will be limited to 1 Exabyte caused by the costs projected in 2011. The aim of this project is to provide an powerful method to analyze large data-sets based on similarities of items in high-dimensions. In this research project the science case is analyzing unlabeled data-sets from the Sloan Digital Sky Survey 3, Data Release 10 (SDSS3 DR10) with a focus on similarity/dissimilarity relationship of two objects. Each object is represented by a feature vector of approx. 5000 dimensions. Those vectors display a numeric value of a captured spectrum with uncorrelated noise for all specific wavelengths. The whole data-set consists of 3 million objects with 60 GBytes of raw data, total. As all objects have to be compared with each other the resulting complexity is O(n2) in computation and storage.

Reflecting the science-case mentioned, a naive full analysis with distance- , density clustering algorithms would end in 542 days of processing time using a single similarity measure on a 128 cores computing-cluster. This assumes a time of 2 ms per comparison including loading and saving data. Effectively the 9 × 1012 comparisons would produce between 24 TByte and 120 PByte of resulting data depending on the level of detail. This project aims at developing a method analyzing this data in an acceptable timeframe.

Research topics

This project develops a method for efficient clustering of large high dimensional data-sets in astronomy. Important aspects are:

Development of efficient algorithms solving n^2 problems in lower computational and storage complexity
Desiging similarity metrics for domain specific research questions
Integrating similiarity metrics within the algorithm
Extending similarity metrics with uncertainty quantification methods improving the result quality and creating an controllable noise tolerant metric

Consortium

Astroinformatics Group (AIN)
Data Mining and Uncertainty Quantification Group (DMQ)

People

Maximilian Hoecker (DMQ)
Vincent Heuveline
Kai Lars Polsterer (AIN)
Dennis Kügler (AIN)

Contact

Maximilian Hoecker

Name	Borlabs Cookie
Provider	Eigentümer dieser Website
Purpose	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Expiry	1 Jahr

Accept	Matomo
Name	Matomo
Provider	HITS gGmbH
Purpose	Cookie von Matomo für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Cookie Name	_pk_.
Cookie Expiry	13 Monate

Accept	Facebook
Name	Facebook
Provider	Meta Platforms Ireland Limited, 4 Grand Canal Square, Dublin 2, Ireland
Purpose	Wird verwendet, um Facebook-Inhalte zu entsperren.
Privacy Policy	https://www.facebook.com/privacy/explanation
Host(s)	.facebook.com

Accept	Google Maps
Name	Google Maps
Provider	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Purpose	Wird zum Entsperren von Google Maps-Inhalten verwendet.
Privacy Policy	https://policies.google.com/privacy
Host(s)	.google.com
Cookie Name	NID
Cookie Expiry	6 Monate

Accept	Instagram
Name	Instagram
Provider	Meta Platforms Ireland Limited, 4 Grand Canal Square, Dublin 2, Ireland
Purpose	Wird verwendet, um Instagram-Inhalte zu entsperren.
Privacy Policy	https://www.instagram.com/legal/privacy/
Host(s)	.instagram.com
Cookie Name	pigeon_state
Cookie Expiry	Sitzung

Accept	OpenStreetMap
Name	OpenStreetMap
Provider	Openstreetmap Foundation, St John’s Innovation Centre, Cowley Road, Cambridge CB4 0WS, United Kingdom
Purpose	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Privacy Policy	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Expiry	1-10 Jahre

Accept	Twitter
Name	Twitter
Provider	Twitter International Company, One Cumberland Place, Fenian Street, Dublin 2, D02 AX07, Ireland
Purpose	Wird verwendet, um Twitter-Inhalte zu entsperren.
Privacy Policy	https://twitter.com/privacy
Host(s)	.twimg.com, .twitter.com
Cookie Name	__widgetsettings, local_storage_support_test
Cookie Expiry	Unbegrenzt