Robustness of Various Web Scraping Methods
URL | http://edoc.sub.uni-hamburg.de/informatik/volltexte/2025/287/ |
---|---|
Dokumentart: | Bachelor Thesis |
Institut: | Fachbereich Informatik |
Sprache: | Englisch |
Erstellungsjahr: | 2024 |
Publikationsdatum: | 20.01.2025 |
Freie Schlagwörter (Englisch): | Web Scraping , Website Change , Robustness , Large Language Models |
DDC-Sachgruppe: | Informatik |
BK - Klassifikation: | 54.00 |
Kurzfassung auf Englisch:
Web scraping is a powerful technique for the targeted extraction of data from a website. For the extraction of an element, a query is written which links to the data on the website. When a website undergoes changes, these queries often break, leading to unreliable data collection. If a query can withstand changes over a longer period of time, it is considered robust. Since robustness between different web scraping methods is a largely unexplored topic, this thesis provides a comprehensive comparison between them. This work also introduces and analyses two new scraping methods using Large Language Models (LLMs), that have been little researched to date—LLM Minimize and LLM Screenshot. LLM Minimize scrapes using a Large Language Model and a minimized version of the HTML. LLM Screenshot scrapes using a Large Language Model and a screenshot of the web page. These two methodologies are compared together with the three state-of-the-art methodologies XPath, CSS Path and CSS Selector. The analysis is carried out on a large dataset of websites and manually evaluated to determine which methodologies are best suited for the long-term use of web scraping. This evaluation shows that the LLM approaches are significantly more robust against change than the state-ofthe-art, static approaches. However, they also return an incorrect element much more frequently than none at all, if the element being searched for, changes. This makes error detection much more difficult. The LLM queries in this work are not generated automatically. As a result, they have an extremely high generation time. Their application time is also significantly higher than that of the static queries. The analysis also measures the change in the website from various perspectives. A robustness analysis is performed, taking this into account. This is used to make predictions for the probability of success of queries. It was shown that considering all website change categories, LLM Minimize and LLM Screenshot have the weakest correlation between the change and the functioning of the query, as well as the highest tolerance towards change.
Hinweis zum Urherberrecht
Für Dokumente, die in elektronischer Form über Datenenetze angeboten werden, gilt uneingeschränkt das Urheberrechtsgesetz (UrhG). Insbesondere gilt:
Einzelne Vervielfältigungen, z.B. Kopien und Ausdrucke, dürfen nur zum privaten und sonstigen eigenen Gebrauch angefertigt werden (Paragraph 53 Urheberrecht). Die Herstellung und Verbreitung von weiteren Reproduktionen ist nur mit ausdrücklicher Genehmigung des Urhebers gestattet.
Der Benutzer ist für die Einhaltung der Rechtsvorschriften selbst verantwortlich und kann bei Mißbrauch haftbar gemacht werden.