Abstract
Cloud object stores such as Amazon S3 are some of the largest and most cost-effective storage systems on the planet, making them an attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as key-value stores makes it difficult to achieve ACID transactions and high performance: metadata operations such as listing objects are expensive, and consistency guarantees are limited. In this paper, we present Delta Lake, an open source ACID table storage layer over cloud object stores initially developed at Databricks. Delta Lake uses a transaction log that is compacted into Apache Parquet format to provide ACID properties, time travel, and significantly faster metadata operations for large tabular datasets (e.g., the ability to quickly search billions of table partitions for those relevant to a query). It also leverages this design to provide high-level features such as automatic data layout optimization, upserts, caching, and audit logs. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift and other systems. Delta Lake is deployed at thousands of Databricks customers that process exabytes of data per day, with the largest instances managing exabyte-scale datasets and billions of objects.
- Amazon Athena. https://aws.amazon.com/athena/.Google Scholar
- Amazon Kinesis. https://aws.amazon.com/kinesis/.Google Scholar
- Amazon Redshift. https://aws.amazon.com/redshift/.Google Scholar
- Amazon S3. https://aws.amazon.com/s3/.Google Scholar
- Apache Hadoop. https://hadoop.apache.org.Google Scholar
- Apache Kudu. https://kudu.apache.org.Google Scholar
- Apache HBase. https://hbase.apache.org.Google Scholar
- Apache Hudi. https://hudi.apache.org.Google Scholar
- Apache Hudi GitHub issue: Future support for multi-client concurrent write? https://github.com/apache/incubator-hudi/issues/1240.Google Scholar
- Apache Iceberg. https://iceberg.apache.org.Google Scholar
- Apache Kafka. https://kafka.apache.org.Google Scholar
- Apache ORC. https://orc.apache.org.Google Scholar
- Apache Parquet. https://parquet.apache.org.Google Scholar
- M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Structured streaming: A declarative API for real-time applications in Apache Spark. In SIGMOD, page 601--613, New York, NY, USA, 2018. Association for Computing Machinery. Google ScholarDigital Library
- M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A view of cloud computing. Communications of the ACM, 53:50--58, 04 2010. Google ScholarDigital Library
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015. Google ScholarDigital Library
- Azure Blob Storage. https://https://azure.microsoft.com/en-us/services/storage/blobs/.Google Scholar
- Azure Data Lake Storage. https://azure.microsoft.com/en-us/services/storage/data-lake-storage/.Google Scholar
- P. Bailis, A. Ghodsi, J. Hellerstein, and I. Stoica. Bolt-on causal consistency. pages 761--772, 06 2013. Google ScholarDigital Library
- M. Brantner, D. Florescu, D. Graf, D. Kossmann, and T. Kraska. Building a database on S3. pages 251--264, 01 2008. Google ScholarDigital Library
- A. Conway and J. Minnick. Introducing Delta Engine. https://databricks.com/blog/2020/06/24/introducing-delta-engine.html.Google Scholar
- C. Curino, E. Jones, R. Popa, N. Malviya, E. Wu, S. Madden, H. Balakrishnan, and N. Zeldovich. Relational cloud: A database-as-a-service for the cloud. In CIDR, pages 235--240, 04 2011.Google Scholar
- B. Dageville, J. Huang, A. Lee, A. Motivala, A. Munir, S. Pelley, P. Povinec, G. Rahn, S. Triantafyllis, P. Unterbrunner, T. Cruanes, M. Zukowski, V. Antonov, A. Avanes, J. Bock, J. Claybaugh, D. Engovatov, and M. Hentschel. The Snowflake elastic data warehouse. pages 215--226, 06 2016. Google ScholarDigital Library
- P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and . G. P. A. Group. The variant call format and VCFtools. Bioinformatics, 27(15):2156--2158, 06 2011. Google ScholarDigital Library
- Databricks runtime. https://databricks.com/product/databricks-runtime.Google Scholar
- Delta Lake website. https://delta.io.Google Scholar
- General Data Protection Regulation. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46. Official Journal of the European Union, 59:1--88, 2016.Google Scholar
- Glow: An open-source toolkit for large-scale genomic analysis. https://projectglow.io.Google Scholar
- Google BigQuery. https://cloud.google.com/bigquery.Google Scholar
- Google Cloud Storage. https://cloud.google.com/storage.Google Scholar
- Google Cloud Storage consistency documentation. https://cloud.google.com/storage/docs/consistency.Google Scholar
- Hive 3 ACID documentation from Cloudera. https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-hiveql/content/hive_3_internals.html.Google Scholar
- H. Jaani. New data ingestion network for Databricks: The partner ecosystem for applications, database, and big data integrations into Delta Lake. https://databricks.com/blog/2020/02/24/new-databricks-data-ingestion-network-for-applications-database-and-big-data-integrations-into-delta-lake.html, 2020.Google Scholar
- H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16):2078--2079, Aug. 2009. Google ScholarDigital Library
- G. M. Morton. A computer oriented geodetic data base; and a new technique in file sequencing. IBM Technical Report, 1966.Google Scholar
- S. Naik and B. Gummalla. Small files, big foils: Addressing the associated metadata and application challenges. https://blog.cloudera.com/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/, 2019.Google Scholar
- F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, and et al. Rethinking data-intensive science using scalable analytics systems. In SIGMOD, page 631--646, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- OpenStack Swift. https://www.openstack.org/software/releases/train/components/swift.Google Scholar
- Querying external data using Amazon Redshift Spectrum. https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html.Google Scholar
- S3 consistency documentation. https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel.Google Scholar
- S3 ListObjectsV2 API. https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html.Google Scholar
- R. Sethi, M. Traverso, D. Sundstrom, D. Phillips, W. Xie, Y. Sun, N. Yegitbasi, H. Jin, E. Hwang, N. Shingte, and C. Berner. Presto: SQL on everything. In ICDE, pages 1802--1813, April 2019.Google ScholarCross Ref
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A column-oriented dbms. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05, page 553--564. VLDB Endowment, 2005. Google ScholarDigital Library
- C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landray, B. Liu, P. Matthews, G. Ong, J. Pell, A. Silman, A. Young, T. Sprosen, T. Peakman, and R. Collins. UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine, 12(3):1--10, 03 2015.Google ScholarCross Ref
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE, pages 996--1005. IEEE, 2010.Google ScholarCross Ref
- M.-L. Tomsen Bukovec. AWS re:Invent 2018. Building for durability in Amazon S3 and Glacier. https://www.youtube.com/watch?v=nLyppihvhpQ, 2018.Google Scholar
- Transaction Processing Performance Council. TPC benchmark DS standard specification version 2.11.0, 2019.Google Scholar
- Understanding block blobs, append blobs, and page blobs. https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs-append-blobs-and-page-blobs.Google Scholar
- A. Verbitski, X. Bao, A. Gupta, D. Saha, M. Brahmadesam, K. Gupta, R. Mittal, S. Krishnamurthy, S. Maurice, and T. Kharatishvili. Amazon Aurora: Design considerations for high throughput cloud-native relational databases. In SIGMOD, pages 1041--1052, 05 2017. Google ScholarDigital Library
- R. Yao and C. Crosbie. Getting started with new table formats on Dataproc. https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc.Google Scholar
- M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull., 41:39--45, 2018.Google Scholar
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI, pages 15--28, 2012. Google ScholarDigital Library
Recommendations
Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataAzure Data Lake Store (ADLS) is a fully-managed, elastic, scalable, and secure file system that supports Hadoop distributed file system (HDFS) and Cosmos semantics. It is specifically designed and optimized for a broad spectrum of Big Data analytics ...
Study on the Response of Lake Chaohu Eutrophication to Yangtze River - Lake Chaohu Water Transfer Project
CSIE '09: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 05The eutrophication of Lake Chaohu is longtime integrated and accumulative effect of pollution load, hydrological regime and exterior condition and so on. But one of the direct reasons leading to deterioration of hydrological situation is the gate dams ...
Comments