The Data Source #21 | Apache Arrow DataFusion - A Primer

Apr 08, 2024

Welcome to The Data Source, your monthly newsletter covering the top innovation and trends across in cloud-infrastructure, developer tools and data.

Subscribe now and never miss an issue 👇🏼

Apache Arrow DataFusion is a query engine built by Andy Grove, designed for building robust data-centric systems. Developed in Rust, a language known for its performance and memory safety, DataFusion serves as a foundational component for processing extensive datasets.

In the evolving landscape of data processing and analytics platforms, DataFusion distinguishes itself with its seamless integration with the Apache Arrow format. Apache Arrow, a cross-language development platform for in-memory data, is designed to expedite analytical processing. Leveraging Arrow's columnar memory layout and zero-copy data exchange capabilities, DataFusion enhances performance and efficiency in handling large datasets. This integration minimizes data movement and serialization overhead, thereby facilitating faster query execution and reducing computational costs.

Within the competitive market, various platforms offer unique strengths. Spark provides versatility and scalability alongside a rich ecosystem of libraries. Presto excels in interactive SQL querying over large datasets. Flink prioritizes real-time stream processing with low latency and high throughput. Dask competes in scalable data processing, particularly for Python-centric workflows. ClickHouse stands out for its prowess in real-time analytics.

The growing community around DataFusion suggests that more advancements will be made in the project. What I’m most excited about is seeing whether DataFusion ends up challenging Apache Spark's dominance, given Spark's widespread adoption and associated complexities around data movement and resource provisioning.

Currently, DataFusion is widely adopted across diverse data processing use cases:

Specialized analytical database systems like HoraeDB leverage DataFusion for in-depth data analysis.
General-purpose systems such as Ballista rely on DataFusion for diverse data processing tasks similar to Apache Spark.
Projects like Blaze use DataFusion as a native Spark runtime replacement, offering improved performance.
Streaming data platforms like Synnada benefit from DataFusion's capabilities in processing continuous data streams, suited for transaction-oriented systems.
Research platforms like Flock utilize DataFusion for experimenting with new ideas in data storage and analysis.

⚙️ Tools to know 🔨

Arroyo, a Rust-based distributed stream processing engine

📙 We built a new SQL Engine on Arrow and DataFusion
Arroyo unveiled a revamped SQL engine powered by Apache Arrow and DataFusion SQL toolkit. Initially relying on DataFusion solely for SQL parsing, Arroyo now integrates its physical plans, operators, and expressions, enhancing flexibility and compatibility, especially for streaming operations. This shift mirrors the momentum within the Rust data community, with DataFusion spearheading efforts to bolster streaming capabilities alongside its established batch processing prowess.

Comet, plugin for Apache Spark enabling native query execution

📙 Apple’s Comet Brings Fast Vector Processing to Apache Spark
Apple has released a plug-in designed to enhance Apache Spark's execution of vector searches, thus augmenting the platform's suitability for large-scale machine learning data analysis. Comet, built upon the adaptable Apache DataFusion query engine, written in Rust, and utilizing the Arrow columnar data format, aims to expedite Spark query execution.

InfluxDB, a Time Series Database

📙 Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0
InfluxDB 3.0 faces the challenge of efficiently handling incoming data and making it queryable. Instead of building a new engine from scratch, InfluxData chose to use Apache Arrow DataFusion for querying time series data using SQL. This enables them to implement custom features like late arrival resolution and time series gap filling with ease, and extend functionalities like the Flux language in their cloud environment.

ParadeDB, PostgreSQL for Search & Analytics

📙 pg_analytics: Transforming Postgres into a Fast OLAP Database
Building an advanced analytical database in Postgres is expensive and challenging. Early attempts like Greenplum and subsequent products from Citus and Timescale have struggled to match the performance of non-Postgres databases. This has led many companies to favor alternatives like Elasticsearch. However, with the rise of embeddable query engines like DataFusion, the ParadeDB team believes the project can outpace many OLAP databases in query speed, indicating a shift away from building query engines from scratch within databases. By integrating with DataFusion, there can be continuous improvements made to the database performance.

Let’s Chat! ☎️

Are you a data practitioner leveraging DataFusion or interested in chatting about the broader landscape of data processing / analytics tools? Please reach out as I’d love to swap notes!

My Twitter (@psomrah) & 📩 (priyanka@work-bench.com) are always so please reach out to chat.

Priyanka 🌊

…

I’m a Principal at Work-Bench, an early stage enterprise software-focused VC firm based in NYC. Our sweet spot for investment is at the Pre-Seed & Seed stage. This correlates with building out a startup’s early go-to-market motions. In the data, dev tools & infra world, we’ve invested in companies like Cockroach Labs, Arthur, Alkymi, Streamdal and others.