220 Pages
English

Professional search in pharmaceutical research [Elektronische Ressource] / vorgelegt von Alex Kohn

-

Gain access to the library to view online
Learn more

Description

Professional Search in Pharmaceutical Research Alex Kohn München 2009 Professional Search in Pharmaceutical Research Alex Kohn Dissertation an der Fakultät für Mathematik, Informatik und Statistik der Ludwig ‐Maximilians ‐Universität München vorgelegt von Alex Kohn München, den 24.11.2009 Erstgutachter: Prof. Dr. François Bry (Ludwig ‐Maximilians ‐Universität München) Zweitgutachter: Prof. Dr. Steffen Staab (Universität Koblenz ‐Landau) Tag der mündlichen Prüfung: 19.01.2010 Abstract In the mid 90s, visiting libraries – as means of retrieving the latest literature – was still a common necessity among professionals. Nowadays, professionals simply access information by ‘googling’. Indeed, the name of the Web search engine market leader “Google” became a synonym for searching and retrieving information. Despite the increased popularity of search as a method for retrieving relevant information, at the workplace search engines still do not deliver satisfying results to professionals. Search engines for instance ignore that the relevance of answers (the satisfaction of a searcher’s needs) depends not only on the query (the information request) and the document corpus, but also on the working context (the user’s personal needs, education, etc.).

Subjects

Informations

Published by
Published 01 January 2009
Reads 15
Language English
Document size 4 MB










Professional Search in
Pharmaceutical Research


Alex Kohn



















München 2009









Professional Search in
Pharmaceutical Research


Alex Kohn











Dissertation
an der Fakultät für Mathematik, Informatik und Statistik
der Ludwig ‐Maximilians ‐Universität
München



vorgelegt von
Alex Kohn






München, den 24.11.2009































Erstgutachter: Prof. Dr. François Bry
(Ludwig ‐Maximilians ‐Universität München)
Zweitgutachter: Prof. Dr. Steffen Staab
(Universität Koblenz ‐Landau)
Tag der mündlichen Prüfung: 19.01.2010

Abstract

In the mid 90s, visiting libraries – as means of retrieving the latest literature – was
still a common necessity among professionals. Nowadays, professionals simply
access information by ‘googling’. Indeed, the name of the Web search engine market
leader “Google” became a synonym for searching and retrieving information.
Despite the increased popularity of search as a method for retrieving relevant
information, at the workplace search engines still do not deliver satisfying results to
professionals.

Search engines for instance ignore that the relevance of answers (the satisfaction of
a searcher’s needs) depends not only on the query (the information request) and the
document corpus, but also on the working context (the user’s personal needs,
education, etc.). In effect, an answer which might be appropriate to one user might
not be appropriate to the other user, even though the query and the document
corpus are the same for both. Personalization services addressing the context
become therefore more and more popular and are an active field of research.

This is only one of several challenges encountered in ‘professional search’: How can
the working context of the searcher be incorporated in the ranking process; how can
unstructured free ‐text documents be enriched with semantic information so that the
information need can be expressed precisely at query time; how and to which extent
can a company’s knowledge be exploited for search purposes; how should data from
distributed sources be accessed from into one ‐single ‐entry ‐point.

This thesis is devoted to ‘professional search’, i.e. search at the workplace, especially
in industrial research and development. We contribute by compiling and developing
several approaches for facing the challenges mentioned above. The approaches are
implemented into the prototype YASA (Your Adaptive Search Agent) which provides
meta ‐search, adaptive ranking of search results, guided navigation, and which uses
domain knowledge to drive the search processes. YASA is deployed in the
pharmaceutical research department of Roche in Penzberg – a major pharmaceutical
company – in which the applied methods were empirically evaluated.

Being confronted with mostly unstructured free ‐text documents and having barely
explicit metadata at hand, we faced a serious challenge. Incorporating semantics (i.e.
formal knowledge representation) into the search process can only be as good as the
underlying data. Nonetheless, we are able to demonstrate that this issue can be
largely compensated by incorporating automatic metadata extraction techniques.
The metadata we were able to extract automatically was not perfectly accurate, nor
did the ontology we applied contain considerably “rich semantics”. Nonetheless, our
results show that already the little semantics incorporated into the search process,
suffices to achieve a significant improvement in search and retrieval.

We thus contribute to the research field of context ‐based search by incorporating
the working context into the search process – an area which so far has not yet been
well studied.
v Zusammenfassung

Die seit den 90er Jahren vorherrschende Informationsflut als auch das Aufkommen
neuer Technologien haben die Prozesse des Informationszugriffes auf nie
dagewesene Art und Weise geprägt. Das Ergebnis dieses Wandels ist, daß Menschen
heutzutage nach Informationen ‚googeln’ anstatt Bibliotheken zu durchstöbern.
Tatsächlich ist der Name des derzeitigen Internet ‐Suchmaschinen ‐Marktführers
Google zu einem Synonym für die Suche nach Informationen geworden. Dieses
Phänomen betrifft insbesondere auch Experten, für die Suche nach Informationen
ein Teil des alltäglichen Geschäftes ist. Folglich sind Suchmaschinen nicht nur im Web
die erste Wahl um Informationen zu finden sondern auch im Intranet von Firmen.

Obwohl die Verwendung von Suchmaschinen bei Experten – insbesondere bei
Fachkräften in Unternehmen – sehr populär geworden ist, liefern Suchmaschinen im
Intranet immer noch nicht zufriedenstellende Ergebnisse.

Eine mögliche Ursache unter anderen ist, daß Suchmaschinen häufig den Kontext des
Suchenden (persönliche Bedürfnisse, Hintergrundwissen, usw.) ignorieren.
Tatsächlich ist aber die Relevanz eines Suchergebnisses, nicht nur von der
eigentlichen Suchanfrage und der Dokumentsammlung abhängig, sondern auch vom
Arbeitskontext des Suchenden. Folglich kann eine Antwort – bei gleichbleibender
Suchanfrage und identischem Korpus – für den einen Benutzer relevant sein und für
den anderen Benutzer nicht. Die Einbeziehung des Kontexts bei der Suche ist ein
aktives Forschungsfeld und wird zunehmend auch in Personalisierungsdiensten
führender Internet ‐Suchmaschinen berücksichtigt.

Kontext ‐basierte Suche ist nur eine von vielen Herausforderungen im Umfeld von
spezialisierten Suchmaschinen: Wie kann der Arbeitskontext des Suchenden in die
Ermittlung der Rangfolge der Ergebnisdokumente einbezogen werden; Wie können
vorhandene Daten mit semantischen Informationen bereichert werden, so daß die
Frage präzise formuliert werden kann; Wie und zu welchem ausmaß kann das
Vorwissen eines Unternehmens dazu genutzt werden die Suche zu verbessern; Wie
sollen verteilte Daten in einem Suchportal zusammengefaßt werden.

Die vorliegende Dissertation befaßt sich dem Thema „Expertensuche“, d.h. Suche am
Arbeitsplatz, insbesondere in der Forschung und Entwicklung. Ein Beitrag dieser
Arbeit liegt in der Zusammenstellung und Entwicklung von Ansätzen, mit denen den
zuvor genannten Herausforderungen begegnet werden kann. Die Ansätze werden in
dem Prototyp YASA (Your Adaptive Search Agent) implementiert, welcher Meta ‐
Suche, adaptive Sortierung von Suchergebnissen und unterstütztes Navigieren
ermöglicht. Zahlreiche Prozesse profitieren dabei von domänen ‐spezifischem
Wissen. YASA wird in der pharmazeutischen Forschungsabteilung von Roche in
Penzberg (ein größeres Pharma Unternehmen) produktiv genutzt. Letzteres bietet
ein ideales Umfeld für die empirische Untersuchung der angewandten Prinzipien.


vi Die überwiegende Speicherung der Daten in Form unstrukturierter Textdokumente
und das Fehlen expliziter Metadaten, stellten eine ernste Herausforderung dar. Die
Einbindung von Semantik (traditionell als formale Wissensrepräsentation
verstanden) kann nämlich nur so gut sein wie die zugrundeliegenden Daten.
Nichtsdestotrotz sind wir in der Lage dieses Problem durch Einbindung
automatischer Metadaten ‐Extraktionsmethoden weitgehend zu umgehen. Die
Metadaten, welche wir extrahieren konnten, waren weder perfekt noch war die
daraus resultierende und von uns verwendete Ontologie semantisch betrachtet
besonders reich. Unsere Ergebnisse zeigen aber, daß bereits ein bißchen Semantik
die Informationsbeschaffung deutlich erleichtert.

Der Beitrag der Arbeit liegt also auf dem Gebiet der kontext ‐basierten Suche, d.h.
der Einbeziehung des Arbeitskontexts in den Suchprozeß – ein Gebiet, welches bis
jetzt noch nicht gut erforscht wurde.

vii Acknowledgements

“It is with words as with sunbeams. The more they are condensed, the deeper they
burn.”
Robert Southey (1774 – 1843)

I would like to thank my scientific mentor and advisor Prof. François Bry, who –
always optimistic and positive – helped me discover my research interests as well as
to find and shape my ideas. I appreciated the discussions with him, his feedback, and
his friendly attitude throughout my dissertation. I am also grateful to Prof. Steffen
Staab for his willingness to scientific cooperation. The experience and knowledge he
provided were particularly enlightening and helpful for my research.

Next, I thank my supervising tutor Dr. Alexander Manta at Roche and the company
itself for offering me the scientific and financial opportunity to pursue my doctoral
thesis. In countless discussions, Alexander guided me in times of despair, doubt, and
idea hunting. His open mindedness gave me a lot of flexibility and freedom during
my research. Special thanks to my colleague Dr. Stefan Klostermann who provided
valuable insights about research at Roche and guidance during my first year. Thanks
to my colleagues from In Silico Sciences who gave me advice and support in all
questions related to bioinformatics and statistics. Thanks to the technical staff of
Scientific Research Informatics for their support with servers and hardware
upgrades. Special thanks to all colleagues from Pharma Research who participated in
the evaluation of YASA. Finally, a great and warm thanks to Tobias Högel, who wrote
his diploma thesis under my supervision, and to Florian Stadler and Marc Gössling,
whom I partially supervised during their internship at Roche. Their contribution and
dedication helped me tremendously.

Thanks to my current as well as past fellow students in the PMS department at the
University of Munich, Paula ‐Lavinia P ătrânjan, Sacha Berger, Edgar Stoffel, Tim
Furche, Benedikt Linse, Michael Eckert, Stephan Leutenmayr, Alexander Pohl,
Christoph Wieser, Jakub Kotowski, Klara Weiand, and Olga Poppe who made the
time at the university always a pleasure. Special thanks to Dr. Norbert Eisinger who
gave me valuable feedback about my work. Thanks also to our secretary Ingeborg
von Troschke, who helped me a lot with administrative tasks.

Most importantly I would like to thank my friends and my family for their love and
care. Special thanks to my fiancée who absorbed so much of my stress, gave me
energy and who made me take my mind of things, even though she is quite busy
pursuing her master scholar.

Last, I would like to thank everyone who pushed me to finish this thesis.


viii Contents


I Prelude
Chapter 1 Introduction............................................................................................ 3
1.1 Motivation......................................................................................................4
1.2 Hypotheses ............................................................................4
1.3 Contributions ......................................5
1.4 Structure of the thesis ....................................6


II Background
Chapter 2 Search for information...........................................................................13
2.1 Information retrieval process overview ......................................................14
2.2 Traditional retrieval models.........................................................................16
2.2.1 Boolean model ....................................17
2.2.2 Vector space model .............................................................................19
2.2.3 Term weighting ............................................................21
2.2.4 Discussion............................................24
2.3 Text processing ............................................................................................25
2.3.1 Tokenization..................25
2.3.2 Stopword removal ...............................................................................26
2.3.3 Stemming and lemmatization......................................26
2.4 Search in the World Wide Web ..................27
2.4.1 PageRank in a nutshell .........................................................................28
2.4.2 HITS in a nutshell.................................30
2.4.3 Web search engines.............................................................................31
2.4.4 Discussion.....................................................................33
2.5 Search in an intranet environment......................................35
2.5.1 Differences between intranet search and Web search .......................35
2.5.2 Open issues in intranet search.............................................................37
2.5.3 Search in structured sources .......................................39
2.5.4 Enterprise search engines............................................40
2.6 Evidence for document relevance ........................................44
2.6.1 Content evidence.................................................................................45
2.6.2 Context evidence .........46
2.6.3 Time evidence ..............................................................46
2.6.4 Hyperlink evidence...............................................................................47
2.6.5 URL evidence.......................................47
2.6.6 Feedback evidence..............................48
2.7 Precision and recall ..............................................................48
Chapter 3 Adaptation in Information Retrieval ......................................................51
3.1 User modeling..............................................................................................51
ix Contents
3.1.1 User model types .................................................................................52
3.2 Personalized search .............................................................54
3.2.1 Personalization process types......................................55
3.2.2 Methods for personalized search........56
3.2.3 Personalized search based on the search history................................58
3.2.4 Adaptation of search results based on the search history ..................61
3.2.5 Other applications of personalized search ..........................................61
3.3 Recommender systems.........................................................62
3.4 Algorithms for recommender systems................................64
3.4.1 Memory ‐based algorithms ..........................................65
3.4.2 Model ‐based al..............................................67
3.5 Discussion..............................................................................67
Chapter 4 Semantic Technologies ..........................................................................71
4.1 Semantic Web ..............................................................................................71
4.1.1 Resource Description Framework (RDF)..............................................73
4.1.2 RDF Schema (RDFS).....................................................75
4.1.3 Web Ontology Language (OWL) .................................77
4.2 F ‐Logic ..................................................................................78
4.2.1 The is ‐a hierarchy.................................................................................79
4.2.2 The object base............79
4.2.3 Rules and queries.........80
4.2.4 F ‐Logic vs. OWL ‐DL...............................................................................81
4.3 Semantic technologies in Information Retrieval .................82
4.3.1 State of the art ....................................82
4.3.2 Discussion.............................................................................................84
4.4 Semantic technologies in life sciences.................................85
4.4.1 State of the art .....................................................................................85
4.4.2 Discussion.............................................................................................86


III Core
Chapter 5 Characteristics of professional search in pharmaceutical research.........89
5.1 Evaluation of the initial situation in the department investigated .............89
5.1.1 Intranet web.........................................................................................90
5.1.2 File shares .....................91
5.1.3 Databases and applications .........................................93
5.1.4 Search engine usage on the PRPZ ‐WebSite – A log file analysis .........94
5.1.5 Empirical studies of information acquisition ...............95
5.2 Quality characteristics of a professional search tool...................................96
5.2.1 One single entry point..........................................................................96
5.2.2 Role ‐specific ranking ....................................................96
5.2.3 Guided navigation.................................97
5.2.4 Exploit existing knowledge ..................................................................98
5.2.5 Professional search beyond research and development ....................98
x