Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python, Ruby, and JavaScript implementations are in progress.
The reference Arrow implementations contain a number of distinct software components:
- Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
- Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
- Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
- Low-overhead IO interfaces to files on disk, HDFS (C++ only)
- Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
- Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
- Conversions to and from other in-memory data structures (e.g. Python's pandas library)
https://arrow.apache.org/
Apache Arrow and the "10 Things I Hate About Pandas" - http://wesmckinney.com/blog/apache-arrow-pandas-internals/
No comments:
Post a Comment