Tuesday, November 9, 2021

Polars

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow(2) as memory model.

The goal of Polars is being a lightning fast DataFrame library that utilizes all available cores on your machine.

Polars is semi-lazy. It allows you to do most of your work eagerly, similar to pandas, but it does provide you with a powerful expression syntax that will be optimized executed on polars' query engine.

Polars also supports full lazy query execution that allows for more query optimization.

Polars keeps track of your query in a logical plan. This plan is optimized and reordered before running it. When a result is requested Polars distributes the available work to different executors that use the algorithms available in the eager API to come up with the result. Because the whole query context is known to the optimizer and executors of the logical plan, processes dependent on separate data sources can be parallelized on the fly.

Below a concise list of the features allowing Polars to meet its goals:

  • Copy-on-write (COW) semantics
    • "Free" clones
    • Cheap appends
  • Appending without clones
  • Column oriented data storage
    • No block manager (-i.e.- predictable performance)
  • Missing values indicated with bitmask
    • NaN are different from missing
    • Bitmask optimizations
  • Efficient algorithms
  • Query optimizations
    • Predicate pushdown
      • Filtering at scan level
    • Projection pushdown
      • Projection at scan level
    • Simplify expressions
    • Parallel execution of physical plan
  • SIMD vectorization
  • NumPy universal functions

https://pola-rs.github.io/polars-book/user-guide/index.html 

https://github.com/pola-rs/polars 

https://www.kdnuggets.com/2021/05/pandas-faster-pypolars.html 

No comments:

Post a Comment