Tuesday, January 16, 2018

Apache Arrow

"Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.

Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python, Ruby, and JavaScript implementations are in progress.

The reference Arrow implementations contain a number of distinct software components:
  • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
  • Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
  • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
  • Low-overhead IO interfaces to files on disk, HDFS (C++ only)
  • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
  • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
  • Conversions to and from other in-memory data structures (e.g. Python's pandas library)
https://github.com/apache/arrow

https://arrow.apache.org/

Apache Arrow and the "10 Things I Hate About Pandas" - http://wesmckinney.com/blog/apache-arrow-pandas-internals/




 

No comments:

Post a Comment