Presentation
Distributed document deduplication over slurm-based HPC environments.
DescriptionThe rise of LLMs (Large Language Models) has increased the need
not only for larger amounts of data, but also for data quality. In
order to achieve this, data engineering procedures like
deduplication are applied. Deduplication is a very computationally
intensive task that typically requires HPC (High Performance
Computing) environments when dealing with large datasets. This work presents a large-scale scientific open-source pipeline to
perform exact deduplication over a corpus of data. The proposed
approach uses the distributed file system as a feasible and simple
way to address common limitations and challenges that arise when
working with shared SLURM-based HPC environments.
not only for larger amounts of data, but also for data quality. In
order to achieve this, data engineering procedures like
deduplication are applied. Deduplication is a very computationally
intensive task that typically requires HPC (High Performance
Computing) environments when dealing with large datasets. This work presents a large-scale scientific open-source pipeline to
perform exact deduplication over a corpus of data. The proposed
approach uses the distributed file system as a feasible and simple
way to address common limitations and challenges that arise when
working with shared SLURM-based HPC environments.