Close

Presentation

Predicting Dataset Popularity for Improved Distributed Content Caching in High Energy Physics
DescriptionIn High Energy Physics (HEP), large-scale experiments generate massive amounts of data that are distributed globally. To reduce redundant data transfers and improve analysis efficiency, a disk caching system named XCache is used to manage data accesses. By analyzing 11 months of access logs (4.5 million requests), we identified patterns in dataset usage and developed a predictive model to forecast the popularity of frequently accessed datasets.

Based on extensive exploratory data analysis, we found that pinging the most popular datasets (pinning these in the cache) could significantly improve access efficiency, and we implemented an LSTM model to predict dataset accesses and optimize cache policies.

The model demonstrates strong predictive performance with a low mean relative error of 0.779 across training and test datasets. Future work will incorporate anomaly detection techniques to improve robustness. This study highlights the potential of LSTM models in optimizing distributed content caching in HEP.