Pages

Tuesday, November 9, 2021

Hub

Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size. The hub data layout enables rapid tranformations and streaming of data while training models at scale. Hub is used by Google, Waymo, Red Cross, Oxford University, and Omdena.

Hub includes the following features:

  • Storage agnostic API: Use the same API to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, local storage, as well as in-memory.
  • Compressed storage: Store images and audios in their native compression, decompressing them only when needed, for e.g., when training a model.
  • Lazy NumPy-like slicing: Treat your S3 or GCP datasets as if they are a collection of NumPy arrays in your system's memory. Slice them, index them, or iterate through them. Only the bytes you ask for will be downloaded!
  • Dataset version control: Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well.
  • Third-party integrations: Hub comes with built-in integrations for Pytorch and Tensorflow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
  • Distributed transforms: Rapidly apply transformations on your datasets using multi-threading, multi-processing, or our built-in Ray integration.

https://github.com/activeloopai/Hub 

https://www.kdnuggets.com/2021/11/after-hdf5-data-storage-format-deep-learning.html 

No comments:

Post a Comment