Wednesday, November 10, 2021

anbox

Anbox is a container-based approach to boot a full Android system on a regular GNU/Linux system like Ubuntu. In other words: Anbox will let you run Android on your Linux system without the slowness of virtualization.

Anbox uses Linux namespaces (user, pid, uts, net, mount, ipc) to run a full Android system in a container and provide Android applications on any GNU/Linux-based platform.

The Android inside the container has no direct access to any hardware. All hardware access is going through the anbox daemon on the host. We're reusing what Android implemented within the QEMU-based emulator for OpenGL ES accelerated rendering. The Android system inside the container uses different pipes to communicate with the host system and sends all hardware access commands through these.

https://github.com/anbox/anbox 

DisCo

Extracting actionable insight from complex unlabeled scientific data is an open challenge and key to unlocking data-driven discovery in science. Complementary and alternative to supervised machine learning approaches, unsupervised physics-based methods based on behavior-driven theories hold great promise. Due to computational limitations, practical application on real-world domain science problems has lagged far behind theoretical development. We present our first step towards bridging this divide - DisCo - a high-performance distributed workflow for the behavior-driven local causal state theory. DisCo provides a scalable unsupervised physics-based representation learning method that decomposes spatiotemporal systems into their structurally relevant components, which are captured by the latent local causal state variables. Complex spatiotemporal systems are generally highly structured and organize around a lower-dimensional skeleton of coherent structures, and in several firsts we demonstrate the efficacy of DisCo in capturing such structures from observational and simulated scientific data. To the best of our knowledge, DisCo is also the first application software developed entirely in Python to scale to over 1000 machine nodes, providing good performance along with ensuring domain scientists' productivity. We developed scalable, performant methods optimized for Intel many-core processors that will be upstreamed to open-source Python library packages. Our capstone experiment, using newly developed DisCo workflow and libraries, performs unsupervised spacetime segmentation analysis of CAM5.1 climate simulation data, processing an unprecedented 89.5 TB in 6.6 minutes end-to-end using 1024 Intel Haswell nodes on the Cori supercomputer obtaining 91% weak-scaling and 64% strong-scaling efficiency. 

https://arxiv.org/abs/1909.11822 

https://github.com/adamrupe/DisCo 

 

Tuesday, November 9, 2021

Polars

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow(2) as memory model.

The goal of Polars is being a lightning fast DataFrame library that utilizes all available cores on your machine.

Polars is semi-lazy. It allows you to do most of your work eagerly, similar to pandas, but it does provide you with a powerful expression syntax that will be optimized executed on polars' query engine.

Polars also supports full lazy query execution that allows for more query optimization.

Polars keeps track of your query in a logical plan. This plan is optimized and reordered before running it. When a result is requested Polars distributes the available work to different executors that use the algorithms available in the eager API to come up with the result. Because the whole query context is known to the optimizer and executors of the logical plan, processes dependent on separate data sources can be parallelized on the fly.

Below a concise list of the features allowing Polars to meet its goals:

  • Copy-on-write (COW) semantics
    • "Free" clones
    • Cheap appends
  • Appending without clones
  • Column oriented data storage
    • No block manager (-i.e.- predictable performance)
  • Missing values indicated with bitmask
    • NaN are different from missing
    • Bitmask optimizations
  • Efficient algorithms
  • Query optimizations
    • Predicate pushdown
      • Filtering at scan level
    • Projection pushdown
      • Projection at scan level
    • Simplify expressions
    • Parallel execution of physical plan
  • SIMD vectorization
  • NumPy universal functions

https://pola-rs.github.io/polars-book/user-guide/index.html 

https://github.com/pola-rs/polars 

https://www.kdnuggets.com/2021/05/pandas-faster-pypolars.html 

Hub

Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size. The hub data layout enables rapid tranformations and streaming of data while training models at scale. Hub is used by Google, Waymo, Red Cross, Oxford University, and Omdena.

Hub includes the following features:

  • Storage agnostic API: Use the same API to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, local storage, as well as in-memory.
  • Compressed storage: Store images and audios in their native compression, decompressing them only when needed, for e.g., when training a model.
  • Lazy NumPy-like slicing: Treat your S3 or GCP datasets as if they are a collection of NumPy arrays in your system's memory. Slice them, index them, or iterate through them. Only the bytes you ask for will be downloaded!
  • Dataset version control: Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well.
  • Third-party integrations: Hub comes with built-in integrations for Pytorch and Tensorflow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
  • Distributed transforms: Rapidly apply transformations on your datasets using multi-threading, multi-processing, or our built-in Ray integration.

https://github.com/activeloopai/Hub 

https://www.kdnuggets.com/2021/11/after-hdf5-data-storage-format-deep-learning.html 

PetscSF

PetscSF, the communication component of the Portable, Extensible Toolkit for Scientific Computation (PETSc), is designed to provide PETSc's communication infrastructure suitable for exascale computers that utilize GPUs and other accelerators. PetscSF provides a simple application programming interface (API) for managing common communication patterns in scientific computations by using a star-forest graph representation. PetscSF supports several implementations based on MPI and NVSHMEM, whose selection is based on the characteristics of the application or the target architecture. An efficient and portable model for network and intra-node communication is essential for implementing large-scale applications. The Message Passing Interface, which has been the de facto standard for distributed memory systems, has developed into a large complex API that does not yet provide high performance on the emerging heterogeneous CPU-GPU-based exascale systems. In this paper, we discuss the design of PetscSF, how it can overcome some difficulties of working directly with MPI on GPUs, and we demonstrate its performance, scalability, and novel features.

https://arxiv.org/abs/2102.13018 

https://www.nextplatform.com/2021/03/01/rethinking-mpi-for-gpu-accelerated-supercomputers/ 

 

Arkouda

We have developed a software package, called Arkouda, which allows a user to interactively issue massively parallel computations on distributed data using functions and syntax that mimic NumPy, the underlying computational library used in the vast majority of Python data science workflows. The computational heart of Arkouda is a Chapel interpreter that accepts a pre-defined set of commands from a client (currently implemented in Python) and uses Chapel's built-in machinery for multi-locale and multithreaded execution. Arkouda has benefited greatly from Chapel's distinctive features and has also helped guide the development of the language.

In early applications, users of Arkouda have tended to iterate rapidly between multi-node execution with Arkouda and single-node analysis in Python, relying on Arkouda to filter a large dataset down to a smaller collection suitable for analysis in Python, and then feeding the results back into Arkouda computations on the full dataset. This paradigm has already proved very fruitful for EDA. Our goal is to enable users to progress seamlessly from EDA to specialized algorithms by making Arkouda an integration point for HPC implementations of expensive kernels like FFTs, sparse linear algebra, and graph traversal. With Arkouda serving the role of a shell, a data scientist could explore, prepare, and call optimized HPC libraries on massive datasets, all within the same interactive session.

Arkouda is not trying to replace Pandas but to allow for some Pandas-style operation at a much larger scale. In our experience Pandas can handle dataframes up to about 500 million rows before performance becomes a real issue, this is provided that you run on a sufficently capable compute server. Arkouda breaks the shared memory paradigm and scales its operations to dataframes with over 200 billion rows, maybe even a trillion. In practice we have run Arkouda server operations on columns of one trillion elements running on 512 compute nodes. This yielded a >20TB dataframe in Arkouda.

 

 

https://github.com/Bears-R-Us/arkouda 

https://arkouda.readthedocs.io/en/latest/ 

https://www.youtube.com/watch?v=hzLbJF-fvjQ&t=3s 

https://www.youtube.com/watch?v=g-G_Z_3pgUE 

 

Chapel

Chapel is a modern programming language designed for productive parallel computing at scale. Chapel's design and implementation have been undertaken with portability in mind, permitting Chapel to run on multicore desktops and laptops, commodity clusters, and the cloud, in addition to the high-end supercomputers for which it was originally undertaken.

Why Chapel?  Because it simplifies parallel programming through elegant support for:

  • distributed arrays that can leverage thousands of nodes' memories and cores
  • a global namespace supporting direct access to local or remote variables
  • data parallelism to trivially use the cores of a laptop, cluster, or supercomputer
  • task parallelism to create concurrency within a node or across the system
Chapel Characteristics
  • productive: code tends to be similarly readable/writable as Python
  • scalable: runs on laptops, clusters, the cloud, and HPC systems
  • fast: performance competes with or beats C/C++ & MPI & OpenMP
  • portable: compiles and runs in virtually any *nix environment
  • open-source: hosted on GitHub, permissively licensed

 

https://github.com/chapel-lang/chapel 

https://github.com/Bears-R-Us/arkouda 

https://github.com/pnnl/chgl 

https://github.com/marcoscleison/awesome-chapel 

https://www.youtube.com/channel/UCHmm27bYjhknK5mU7ZzPGsQ 

https://news.ycombinator.com/item?id=22708041