LakeHarbor: Making Structures First-Class Citizens in Data Lakes
Published in ICDE (International Conference on Data Engineering), 2024
Recommended citation: H. Yamada, M. Kitsuregawa, and K. Goda. 2024. LakeHarbor: Making Structures First-Class Citizens in Data Lakes. In ICDE, 5583-5592.
This paper introduces LakeHarbor, a new data management paradigm that makes structures (e.g., indexes) first-class citizens in data lakes. The LakeHarbor paradigm enables a data lake system to flexibly construct structures based on registered access method functions and execute data processing jobs efficiently with the potential parallelism that the structures inherently hold by exploiting the functions while not sacrificing flexible data processing such as schema-on-read. This paper also presents ReDe, a prototype data processing engine that implements LakeHarbor, and a motivating evaluation and a case study of ReDe to explore the potential of LakeHarbor.