Cooperatively Managing and Exploiting Distributed Co-occurrence Graphs
Erscheinungsdatum: 05.10.2022
Reihe: 10
Band Nummer: 879
Autor: Ass. Prof. Dr.-Ing. Supaporn Simcharoen
Ort: Chachoengsao, Thailand
ISBN: 978-3-18-387910-6
ISSN: 0178-9627
Erscheinungsjahr: 2022
Anzahl Seiten: 132
Anzahl Abbildungen: 68
Anzahl Tabellen: 8
Produktart: Buch (paperback, DINA5)
Produktbeschreibung
The property of terms to occur adjacently in text corpora can be expressed by co-occurrence graphs. Such data structures proved to be useful tools for extracting semantic information in natural language processing. When communities work collaboratively on joint documents, the corresponding co-occurrence graphs may grow prohibitively large to be handled on a single machine. Hence, methods are devised to cooperatively distribute and process co-occurrence graphs in peer-to-peer computing environments. Moreover, a local co-occurrence graph in each peer is built from documents of the users attached to the peer and obtained from elsewhere. Routing and load imbalance of inter-peer traffic are reduced by bio-inspired algorithms designed to work with limited information available on peers locally or on their direct neighbours. Routing paths directly leading to the final destinations are determined in most cases, and memory usage is balanced after a few message exchanges. To show that decentralised co-occurrence graphs can be employed as the basis of a fully integrated, decentralised search engine, a prototype working on the world wide web is built, implemented, and experimentally evaluated. This search engine uses text centroids to represent short word groups constituting queries and long texts in crawling the web for information and matching queries with documents.
Contents
Foreword i
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Community-based Knowledge Management . . . . . . . . . . . . . . . . . . . . . 3
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 State of the Art 9
2.1 Principals of Classical Search Engines . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 P2P and Content Distribution Systems . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 DHT Systems for Content Search . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Content Addressable Network (CAN) . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Summarising DHT’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Decentralised Web Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 YaCy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 FAROO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Co-occurrence Graphs and Text-representing Centroids . . . . . . . . . . . . . . 27
2.5.1 Co-occurrence Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.2 Text-representing Centroids (TRC) . . . . . . . . . . . . . . . . . . . . . 28
2.5.3 Fast Calculation of TRC’s . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.4 Searching for Text Documents . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.5 Application of TRC in the WebEngine . . . . . . . . . . . . . . . . . . . 34
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Co-occurrence Graphs 38
3.1 Construction from a Text Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Cluster Building on Co-occurrence Graphs . . . . . . . . . . . . . . . . . . . . . 42
3.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Contents iii
3.4 Experimental Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Decentralised Co-occurrence Graph Management 60
4.1 Distribution and Decentralisation . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Routing and Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Conceptual Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Routing with Limited Local Knowledge . . . . . . . . . . . . . . . . . . 64
4.2.3 An Improved Routing Mechanism . . . . . . . . . . . . . . . . . . . . . . 66
4.2.4 Reduction of Redundant Edges . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.5 Decentralised Lookup and Adding of Words . . . . . . . . . . . . . . . . 68
4.2.6 Multi-Stage Word Insertion . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Experimental Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5 Dynamic Management of Graph Structures 89
5.1 Analysis of the Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Load Balancing Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Experimental Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 The Web Search Engine TheBrain 99
6.1 Major Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 The Architecture of TheBrain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Searching Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4 TheBrain as Business Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Selected Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7 Conclusion and Future Works 115
7.1 Contribution and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
References 117
Keywords: Kookkurrenzgraph – Kookkurrenzanalyse – P2P-Netzwerk – Clustering-Algorithmus – Dezentrales System – Dezentralisierte Verwalteter Kookkurrenzgraph – Dezentrale Suchmaschine – Textanalysetechniken – Textdarstellenden Zentroiden – Empfehlungsdienstes, Co-occurrence Graph – Co-occurrence Analysis – P2P Network – Clustering Algorithm – Decentralised System – Decentralised Managed Co-occurrence Graph – Decentralised Search Engine – Text Analysis Technique – Text-Representing Centroid – Recommendation Service
* Der VDI-Mitgliedsrabatt gilt nur für Privatpersonen