Accelerating Data Compression in HDF5 through Parallel Filter Processing
| URL | http://edoc.sub.uni-hamburg.de/informatik/volltexte/2026/318/ |
|---|---|
| Dokumentart: | Bachelor Thesis |
| Institut: | Fachbereich Informatik |
| Sprache: | Englisch |
| Erstellungsjahr: | 2025 |
| Publikationsdatum: | 02.02.2026 |
| Freie Schlagwörter (Deutsch): | Wissenschaftliches Rechnen , HPC , HDF5 , Parallelrechnen |
| Freie Schlagwörter (Englisch): | Scientific Computing , HPC , HDF5 , parallel computing |
| DDC-Sachgruppe: | Informatik |
| BK - Klassifikation: | 54.25 |
Kurzfassung auf Englisch:
Modern science and engineering creates and accumulates huge amounts of data which is persisted through tools like HDF5 in order to be available for further analysis, display and many other operations. Increasing efficiency in this data processing is critical for nowadays growing data quantities, not only for saving time, but also to efficiently use available resources. This thesis aimed to provide a working prototype and analysis on parallel application of data filters within the HDF5 environment with special emphasis on HDF5 registered filters, such as LZ4 compression. This prototype is embedded into the HDF5 framework, can be freely accessed as any other provided function and is available ready to use after building the project, while maintaining standard library behavior for all other use cases. Generally, implementation is based on an all purpose POSIX thread pool with variable functionality based on registered callback functions and can therefore be used in later versions of development. Analysis is based on comparison of standard library performance and prototype, based on identical example datasets, as well as statical analysis of provided program code. Furthermore CPU utilization and I/O performance are evaluated. Results suggest a great potential for a fully implemented design including most capabilities of the stock HDF5 library. Near full CPU utilisation is shown with little to no wait for I/O completion and therefore cutting down runtime by quite an extensive amount. This prototype, related work and analysis show the great improvement possible by adjusting the existing framework to a multithreaded solution while still maintaining full standard behaviour. Using a multithreaded approach to parallel compress and write data chunks to disk saves time, resources and distributes load more efficiently over an existing infrastructure. As this prototype shows potential but is limited to use of two dimensional datasets in integer type values, future works could be focused on implementing the full bandwidth of datatypes, filters and functionalities in an parallel manner to expand this increase of efficiency over the full library as applicable.
Hinweis zum Urherberrecht
Für Dokumente, die in elektronischer Form über Datenenetze angeboten werden, gilt uneingeschränkt das Urheberrechtsgesetz (UrhG). Insbesondere gilt:
Einzelne Vervielfältigungen, z.B. Kopien und Ausdrucke, dürfen nur zum privaten und sonstigen eigenen Gebrauch angefertigt werden (Paragraph 53 Urheberrecht). Die Herstellung und Verbreitung von weiteren Reproduktionen ist nur mit ausdrücklicher Genehmigung des Urhebers gestattet.
Der Benutzer ist für die Einhaltung der Rechtsvorschriften selbst verantwortlich und kann bei Mißbrauch haftbar gemacht werden.


