269 Pages
English

Advanced data mining techniques for compound objects [Elektronische Ressource] / von Matthias Schubert

-

Gain access to the library to view online
Learn more

Description

Advanced Data MiningTechniques forCompound ObjectsDissertation im Fach Informatikan der Fakult¨at fur¨ Mathematik, Informatik und Statistikder Ludwig-Maximilians-Universit¨at Mun¨ chenvonMatthias SchubertTag der Einreichung: 7. Oktober 2004Tag der mu¨ndlichen Pru¨fung: 9. November 2004Berichterstatter:Prof. Dr. Hans-Peter Kriegel, Ludwig-Maximilians-Universit¨at Mun¨ chenProf. Dr. Martin Ester, Simon Fraser University, British Columbia (Kanada)iiAcknowledgementThere are many people who supported me while I was working on my thesisand I am sorry that I cannot mention all of them in the following. I want toexpress my deep gratitude to all of them.Firstofall,IwouldliketothankProf. Dr. Hans-PeterKriegel,mysupervisorand first referee. He made this work possible by offering me the opportunityto work on my own choice of problems in his excellent research group. Ibenefitted a lot from the opportunities he provided for all of us and enjoyedthe inspiring working atmosphere he created.I want to extend my warmest thanks to Prof. Dr. Martin Ester. He notonly willingly agreed to act as my second referee but also shared a lot of hisknowledge about scientific work and data mining with me. His encourage-ment during our cooperation helped me a lot in doubtful times.Most of the solutions in this thesis were developed in a team and I want toespecially thank the people I published with.

Subjects

Informations

Published by
Published 01 January 2004
Reads 15
Language English
Document size 4 MB

Advanced Data Mining
Techniques for
Compound Objects
Dissertation im Fach Informatik
an der Fakult¨at fur¨ Mathematik, Informatik und Statistik
der Ludwig-Maximilians-Universit¨at Mun¨ chen
von
Matthias Schubert
Tag der Einreichung: 7. Oktober 2004
Tag der mu¨ndlichen Pru¨fung: 9. November 2004
Berichterstatter:
Prof. Dr. Hans-Peter Kriegel, Ludwig-Maximilians-Universit¨at Mun¨ chen
Prof. Dr. Martin Ester, Simon Fraser University, British Columbia (Kanada)iiAcknowledgement
There are many people who supported me while I was working on my thesis
and I am sorry that I cannot mention all of them in the following. I want to
express my deep gratitude to all of them.
Firstofall,IwouldliketothankProf. Dr. Hans-PeterKriegel,mysupervisor
and first referee. He made this work possible by offering me the opportunity
to work on my own choice of problems in his excellent research group. I
benefitted a lot from the opportunities he provided for all of us and enjoyed
the inspiring working atmosphere he created.
I want to extend my warmest thanks to Prof. Dr. Martin Ester. He not
only willingly agreed to act as my second referee but also shared a lot of his
knowledge about scientific work and data mining with me. His encourage-
ment during our cooperation helped me a lot in doubtful times.
Most of the solutions in this thesis were developed in a team and I want to
especially thank the people I published with. I know that working with me
sometimes demands a lot of endurance and often the willingness to follow
my rather broad excursions. I am trying to improve. I especially want to
mention Alexey Pryakhin. The cooperation with him during the supervision
of his diploma thesis and afterwards as a member of our group was a major
influence on the second part of this thesis which I do not want to miss.
Scientific research lives in discussions and therefore I want to thank all of my
colleagues for many interesting conversations and arguments, not to mention
the good times we had.
I would also like to express my deep gratitude to Susanne Grienberger who
was a big help in writing down this thesis. She aided me a lot by carefully
readingthethesisandofferingusefulhintsforpolishingmyEnglish. Further-
more, she often shouldered the administrative burdens for me that are part
of working at an university. An invaluable assistance for technical problems
iiiiv
I received from Franz Krojer. He always came up with fast solutions if more
computing power or additional disc space was needed. So, thank you for
always providing running systems in critical times.
I want to thank my parents for their affection and their help for managing
my life in busy times. Without you, it would have been very difficult to
focus on my research. At last, I want to thank the rest of my family and my
friends. Their belief in me was a driving force behind my efforts.
September 2004,
Matthias SchubertAbstract
KnowledgeDiscoveryinDatabases(KDD)isthenon-trivialprocessofidenti-
fyingvalid, novel, potentiallyuseful, andultimatelyunderstandablepatterns
in large data collections. The most important step within the process of
KDD is data mining which is concerned with the extraction of the valid
patterns. KDD is necessary to analyze the steady growing amount of data
causedbytheenhancedperformanceofmoderncomputersystems. However,
with the growing amount of data the complexity of data objects increases
as well. Modern methods of KDD should therefore examine more complex
objects than simple feature vectors to solve real-world KDD applications ad-
equately. Multi-instance and multi-represented objects are two important
types of object representations for complex objects. Multi-instance objects
consist of a set of object representations that all belong to the same feature
space. Multi-represented objects are constructed as a tuple of feature rep-
resentations where each feature representation belongs to a different feature
space.
The contribution of this thesis is the development of new KDD meth-
ods for the classification and clustering of complex objects. Therefore, the
thesisintroducessolutionsforreal-worldapplicationsthatarebasedonmulti-
instance and multi-represented object representations. On the basis of these
solutions,itisshownthatamoregeneralobjectrepresentationoftenprovides
better results for many relevant KDD applications.
ThefirstpartofthethesisisconcernedwithtwoKDDproblemsforwhich
employing multi-instance objects provides efficient and effective solutions.
The first is the data mining in CAD parts, e.g. the use of hierarchic cluster-
ing for the automatic construction of product hierarchies. The introduced
solution decomposes a single part into a set of feature vectors and compares
them by using a metric on multi-instance objects. Furthermore, multi-step
query processing using a novel filter step is employed, enabling the user to
efficiently process similarity queries. On the basis of this similarity search
system, it is possible to perform several distance based data mining algo-
rithms like the hierarchical clustering algorithm OPTICS to derive product
vvi
hierarchies.
Thesecondimportantapplicationistheclassificationandsearchforcom-
plete websites in the world wide web (WWW). A website is a set of HTML-
documents that is published by the same person, group or organization and
usually serves a common purpose. To perform data mining for websites, the
thesis presents several methods to classify websites. After introducing naive
methodsmodellingwebsitesaswebpages, twomoresophisticatedapproaches
towebsiteclassificationareintroduced. Thefirstapproachusesapreprocess-
ingthatmapssingleHTML-documentswithineachwebsitetoso-calledpage
classes. The second approach directly compares websites as sets of word vec-
tors and uses nearest neighbor classification. To search the WWW for new,
relevant websites, a focused crawler is introduced that efficiently retrieves
relevant websites. This crawler minimizes the number of HTML-documents
and increases the accuracy of website retrieval.
The second part of the thesis is concerned with the data mining in multi-
represented objects. An important example application for this kind of com-
plex objects are proteins that can be represented as a tuple of a protein
sequence and a text annotation. To analyze multi-represented objects, a
clustering method for multi-represented objects is introduced that is based
on the density based clustering algorithm DBSCAN. This method uses all
representations that are provided to find a global clustering of the given
data objects. However, in many applications there already exists a sophisti-
cated class ontology for the given data objects, e.g. proteins. To map new
objects into an ontology a new method for the hierarchical classification of
multi-represented objects is described. The system employs the hierarchical
structure of the ontology to efficiently classify new proteins, using support
vector machines.Zusammenfassung
Knowledge Discovery in Datenbanken (KDD) ist der nicht-triviale Prozess,
neues, gult¨ iges und bisher unbekanntes Wissen aus großen Datenmengen zu
extrahieren. Der wichtigste Schritt im KDD Prozess ist das Data Mining,
das die in den Daten geltenden Muster findet. KDD ist notwendig, um
die stetig wachsenden Datenmengen zu analysieren, die durch die wach-
sende Leistungsf¨ahigkeit moderner Rechensysteme entstanden sind. Aller-
dings steigt auch die Komplexit¨at der Objektdarstellung einzelner Datenob-
jekte an. Moderne KDD Verfahren sollten daher auch mit komplexeren Ob-
jekten als einfachen Merkmalsvektoren umgehen k¨onnen, um reale KDD Ap-
plikationenad¨aquatzul¨osen. ZweiwichtigeArtenvonkomplexenDatenmod-
ellierungen sind mengenwertige und multirepr¨asentierte Objekte. Mengen-
wertigeObjektebestehendabeiauseinerMengevonObjektrepr¨asentationen,
die alle demselben Vektorraum angeh¨oren. Multirepr¨asentierte Objekte sind
durch einen Tupel von Objektrepr¨asentationen gegeben, die jeweils aus un-
terschiedlichen Merkmalsr¨aumen stammen.
DasZieldieserDoktorarbeitistes,neueKDD-VerfahrenimBereichClus-
teringundKlassifikationvonkomplexenObjektenzuentwickeln. Ausgehend
von der Modellierung der Daten als mengenwertige und multirepr¨asentierte
Objekte, werden L¨osungen zu realen Anwendungen vorgestellt. Anhand
dieser L¨osungen wird gezeigt, dass eine allgemeinere Datenmodellierung fur¨
viele relevante Anwendungen zu besseren Ergebnissen fuhr¨ t.
Der erste Teil der Doktorarbeit besch¨aftigt sich mit zwei KDD Prob-
lemen, die unter Verwendung von mengenwertigen Datenobjekten besser
als durch etablierte Verfahren gel¨ost werden k¨onnen. Das erste Problem
ist Data Mining von CAD-Bauteilen, wie z.B. das automatische Erstellen
von Produkthierarchien mit Hilfe des Clustering. Hierzu werden eine Zer-
legung der Bauteile in Mengen von Merkmalsvektoren, eine Metrik auf Vek-
¨tormengen und passende Methoden zur Ahnlichkeitssuche eingefuhr¨ t. Auf
Basis dieses Suchsystems sind dann viele distanzbasierte Data Mining Al-
gorithmen anwendbar, wie zum Beispiel der Clustering-Algorithmus OP-
TICS zur Erstellung von Teilhierarchien. Die zweite Anwendung ist die
viiviii
Kategorisierung und Suche von kompletten Websites im World Wide Web
(WWW).EineWebsitestelltdabeieineMengevonHTML-Dokumentendar,
die von der gleichen Personengruppe mit demselben Zweck im WWW pub-
liziert wurde. Zun¨achst werden mehrere Methoden zur Klassifikation von
Websites vorgestellt. Die einfachsten versuchen dabei Websites genauso wie
einzelne HTML-Dokumente zu klassifizieren. Desweiteren werden zwei fort-
geschrittenereAns¨atzevonKlassifikatorenfur¨ Websiteseingefuhrt.¨ Dererste
Ansatz verwendet einen Vorverarbeitungsschritt, der die einzelnen HTML-
Dokumente auf sogenannte Seitenklassen abbildet. Der zweite Ansatz ver-
gleicht Websites direkt als Mengen von Wortvektoren und wendet N¨achste-
Nachbar-Klassifikation an. Zur Suche neuer Websites im WWW wird ein
fokusierter Webcrawler vorgestellt, der m¨oglichst schnell große Mengen rele-
vanter Websites findet. Das vorgestellte Verfahren minimiert dabei die An-
zahl der geladenen Einzeldokumente und erh¨oht die Trefferrate unter den
gefundenen Ergebnissen.
Der zweite Teil der Arbeit besch¨aftigt sich mit dem Data Mining mul-
tirepr¨asentierter Objekte. Eine wichtige Anwendung fur¨ diesen Bereich ist
das Data Mining von Proteinen, die durch Aminos¨auresequenzen und Text
beschriebenwerden. Zun¨achst, wirdeineClustering-Methodevorgestellt, die
auf dem dichtebasierten Clustering-Algorithmus DBSCAN basiert. Dabei
werden alle vorhandenen Repr¨asentationen verwendet, um eine globale Ein-
teilung der Proteine zu finden. H¨aufig besteht allerdings schon eine von
Experten erstellte Kategorisierung in so genannten Ontologien. Um bisher
noch nicht eingeordnete Objekte in diese Ontologien einzusortieren, wird ein
Verfahren zur hierarchischen Klassifikation von multirepr¨asentierten Objek-
ten vorgestellt. Dieses System nutzt die hierarchische Struktur einer gegebe-
nen Ontologie aus, um die Objekte einer Menge von Klassen mit Hilfe von
Support Vektor Maschinen zuzuordnen.Contents
I Preliminaries 1
1 Introduction 3
1.1 Knowledge Discovery in Databases . . . . . . . . . . . . . . . 4
1.1.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Clustering and Classification . . . . . . . . . . . . . . . 9
1.1.3 Data Mining and Complex Objects . . . . . . . . . . . 12
1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 15
2 Basic Techniques of Knowledge Discovery and Data Mining 19
2.1 Feature Spaces and Data Transformation for KDD . . . . . . . 20
2.1.1 Feature Transformation and Similarity Search . . . . . 20
2.1.2 Data Transformation for Text Documents . . . . . . . 23
2.2 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 The Task of Clustering . . . . . . . . . . . . . . . . . . 29
2.2.2 Directions of Clustering Algorithms . . . . . . . . . . . 29
2.2.3 Density-Based Clustering . . . . . . . . . . . . . . . . . 32
2.3 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 General Aspects of Classification . . . . . . . . . . . . 44
2.3.2 Important Directions of Classification . . . . . . . . . . 48
II Data Mining in Multi-Instance Objects 59
3 Multi-Instance Objects 61
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Multi-Instance Learning . . . . . . . . . . . . . . . . . . . . . 64
ixx CONTENTS
4 Clustering and Similarity Search
in CAD Databases using Multi-Instance Representations 67
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Feature-Based Similarity . . . . . . . . . . . . . . . . . 69
4.2.2 Geometry-Based Similarity . . . . . . . . . . . . . . . . 70
4.3 Similarity Models for
Voxelized CAD Objects. . . . . . . . . . . . . . . . . . . . . . 70
4.3.1 Shape Histograms . . . . . . . . . . . . . . . . . . . . . 70
4.3.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.3 Spatial Features . . . . . . . . . . . . . . . . . . . . . . 73
4.4 A Multi-Instance Representation for CAD-Parts . . . . . . . . 77
4.4.1 Reasons for the Use of
Multi-Instance Objects . . . . . . . . . . . . . . . . . . 80
4.4.2 Distance Measures on Multi-Instance Objects . . . . . 81
4.4.3 Answering Similarity Queries on Vector Set Data Effi-
ciently . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5.2 Clustering CAD-Parts . . . . . . . . . . . . . . . . . . 87
4.5.3 Evaluation of the Efficiency . . . . . . . . . . . . . . . 92
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5 Website Mining 97
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 A Graph-Oriented View of the WWW . . . . . . . . . . . . . 101
5.3 Classification of Websites . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Related Work on Website Classification . . . . . . . . . 104
5.3.2 General Aspects of Websiteation . . . . . . . . 105
5.3.3 Naive Approaches to Website Classification . . . . . . . 106
5.3.4 Classification using Page Classes . . . . . . . . . . . . 107
5.3.5 Classification without Page Classes . . . . . . . . . . . 114
5.3.6 Pruning the Irrelevant Area of a Website . . . . . . . . 120
5.3.7 Evaluation of Website Classifiers . . . . . . . . . . . . 125
5.4 Focused Crawling for Relevant Websites . . . . . . . . . . . . 133
5.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.4.2 Related Work on Focused Crawling . . . . . . . . . . . 135
5.4.3 Retrieving Websites with Focused Crawling . . . . . . 136
5.4.4 Preliminaries to Focused Website Crawling . . . . . . . 137
5.4.5 A Focused Website Crawler . . . . . . . . . . . . . . . 139