A Question Answering Dataset over the DBLP Scholarly Knowledge Graph

URL
Dokumentart: Master Thesis
Institut: Fachbereich Informatik
Sprache: Englisch
Erstellungsjahr: 2023
Publikationsdatum:
Freie Schlagwörter (Deutsch): Knowledge Graph Question Answering , Semantic Parsing , Scholarly Knowledge Graph
Freie Schlagwörter (Englisch): issensgraphen-Fragebeantwortung , Semantisches Parsing , Wissenschaftlicher Wissensgraph
DDC-Sachgruppe: Informatik
BK - Klassifikation: 54.75

Kurzfassung auf Englisch:

Knowledge Graph Question Answering (KGQA) is one of the popular approaches for retrieving information from knowledge graphs. Development of KGQA systems for querying public KGs such as Wikidata, DBpedia, and Freebase has been an active area of research in the past decade. These KGQA systems are capable of answering questions on general information and world facts that can be found in these KGs. One of the limitations of KGQA systems is that they are tied to the KGs they were developed for. Hence, the systems are incapable of answering out-of-domain questions. In addition to the KGs, another building block for KGQA systems are KGQA datasets. The publicly available KGQA datasets have also been developed for the above mentioned KGs. A useful application for KGQA is in the scholarly domain. However, only a few scholarly KGQA datasets are publicly available. With the recent release of the RDF data graph of DBLP, a computer science bibliographic database, development of a KGQA dataset for the DBLP KG is feasible. In this work, a KGQA dataset, called DBLP-QuAD, is developed for the scholarly DBLP KG. The dataset consists of 10,000 question-SPARQL query pairs distributed among 10 different simple and complex query types. DBLP-QuAD encloses challenges for KGQA systems by including augmented entity and literal surface forms, and compositional and zero-shot questions in the test sets. In addition, this thesis also develops and evaluates a semantic parsing baseline on DBLP-QuAD. An evaluation of the T5 model, the current SOTA model on semantic parsing, on DBLP-QuAD shows its shortcomings on compositional and zero-shot questions. The main contribution of this thesis, is DBLP-QuAD, the largest and the first scholarly KGQA dataset for the DBLP KG. DBLP-QuAD introduces challenging entity linking problems, and invites research on generalization ability of large language models.

Hinweis zum Urherberrecht

Für Dokumente, die in elektronischer Form über Datenenetze angeboten werden, gilt uneingeschränkt das Urheberrechtsgesetz (UrhG). Insbesondere gilt:

Einzelne Vervielfältigungen, z.B. Kopien und Ausdrucke, dürfen nur zum privaten und sonstigen eigenen Gebrauch angefertigt werden (Paragraph 53 Urheberrecht). Die Herstellung und Verbreitung von weiteren Reproduktionen ist nur mit ausdrücklicher Genehmigung des Urhebers gestattet.

Der Benutzer ist für die Einhaltung der Rechtsvorschriften selbst verantwortlich und kann bei Mißbrauch haftbar gemacht werden.