In short, Parquet is the major file format in Uber. ![]() Hive is the only engine that drives for producing data in ORC format, but a significant portion of Hive generates tables that are also in Parquet format. Parquet is also the first-class citizen in Spark as the storage format, and Presto is also in favor of Parquet which has tons of optimizations. Our data platform leverages Apache Hive, Apache Presto, and Apache Spark for both interactive and long-running queries. Uber’s data lake ingestion platform uses Apache Hudi, which was bootstrapped at Uber, as the table format and Parquet is the first-class citizenship of Hudi for the file format. Generally, choosing the right compression method is a trade-off between compression ratio and speed for reading and writing. There are several compression methods in Parquet, including SNAPPY, GZIP, LZO, BROTLI, LZ4, and ZSTD. After encoding the compression will make the data size smaller without losing information. Most encodings are in favor of continuous zeros because that can improve encoding efficiency. Parquet provides a variety of encoding mechanisms, including plain, dictionary encoding, run-length encoding (RLE), delta encoding, and byte split encoding. When a Parquet file is written, the column data is assembled into pages, which are the unit for encoding and compression. Figure 1: Apache Parquet File Format Structure Each column chunk is divided into pages, for which the encoding and the compression are performed. The file is divided into row groups and each row group consists of column chunks for each column. The Parquet format is depicted in the diagram below. This will make the subsequent encoding and compression of the column data more efficient. In a column storage format, the data is stored such that each row of a column will be next to other rows from that same column. Apache Parquet is a columnar storage file format that supports nested data for analytic frameworks. Our initiatives and the discussion in this blog are around Parquet. Uber data is ingested into HDFS and registered as either raw or modeled tables, mainly in the Parquet format and with a small portion in the ORC file format. In this blog, we will focus on reducing the data size in storage at the file format level, essentially at Parquet. We started several initiatives to reduce storage cost, including setting TTL (Time to Live) to old partitions, moving data from hot/warm to cold storage, and reducing data size in the file format level. The main goal of this blog is to address storage cost efficiency issues, but the side benefits also include CPU, IO, and network consumption usage. As data volume grows, so do the associated storage and compute costs, resulting in growing hardware purchasing requirements, higher resource usage, and even causing out-of-memory ( OOM ) or high GC pause. Uber’s growth over the last few years exponentially increased both the volume of data and the associated access loads required to process it. Our data platform leverages Apache Hive ™, Apache Presto ™, and Apache Spark ™ for both interactive and long-running queries, serving the myriad needs of different teams at Uber. We use Apache Hudi ™ as our ingestion table format and Apache Parquet ™ as the underlying file format. My question here is that SnappyFlows( ) uses a framing format and Snapp圜odec compression is a native snappy library.Our Apache Hadoop ® based data platform ingests hundreds of petabytes of analytical data with minimum latency and stores it in a data lake built on top of the Hadoop Distributed File System (HDFS). ![]() It complains that: The future returned an exception of type: me., with message Invalid header:Īs a consumer, I definitely do not want to tweak the header. When I tried to decompress using the flow SnappyFlows which is SnappyFlows#decompress val decompressed = Source.since(rawData).via(compress).runWith(Sink.fold(ByteString())) snappy using Snapp圜press and decompressing it using a scala library whih is SnappyFlows returns different result. To my surprise, I found out that compressing a raw data into. I am usually under impression that if the file format is same then each library should have similar logic for decompression/compression. Then I tried io.圜odec which is a Java library for comp/decomp. ![]() I initially used SnappyFlows scala library for comp/decomp. Snappy is a file compression introduced by Google and I am trying my hands on it using scala and java libraries.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |