Deck Chairs and Fiddles: PyTables

Wednesday, April 19, 2017

PyTables

"PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.

PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using Cython), makes it a fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space (specially if on-flight compression is used) than other solutions such as relational or object oriented databases.

PyTables takes advantage of the object orientation and introspection capabilities offered by Python, the powerful data management features of HDF5, and NumPy’s flexibility and Numexpr’s high-performance manipulation of large sets of objects organized in a grid-like fashion to provide these features:

Support for table entities: You can tailor your data adding or deleting records in your tables. Large numbers of rows (up to 2**63, much more than will fit into memory) are supported as well.
Multidimensional and nested table cells: You can declare a column to consist of values having any number of dimensions besides scalars, which is the only dimensionality allowed by the majority of relational databases. You can even declare columns that are made of other columns (of different types).
Indexing support for columns of tables: Very useful if you have large tables and you want to quickly look up for values in columns satisfying some criteria.
Support for numerical arrays: NumPy (see [NUMPY]) arrays can be used as a useful complement of tables to store homogeneous data.
Enlargeable arrays: You can add new elements to existing arrays on disk in any dimension you want (but only one). Besides, you are able to access just a slice of your datasets by using the powerful extended slicing mechanism, without need to load all your complete dataset in memory.
Variable length arrays: The number of elements in these arrays can vary from row to row. This provides a lot of flexibility when dealing with complex data.
Supports a hierarchical data model: Allows the user to clearly structure all data. PyTables builds up an object tree in memory that replicates the underlying file data structure. Access to objects in the file is achieved by walking through and manipulating this object tree. Besides, this object tree is built in a lazy way, for efficiency purposes.
User defined metadata: Besides supporting system metadata (like the number of rows of a table, shape, flavor, etc.) the user may specify arbitrary metadata (as for example, room temperature, or protocol for IP traffic that was collected) that complement the meaning of actual data.
Ability to read/modify generic HDF5 files: PyTables can access a wide range of objects in generic HDF5 files, like compound type datasets (that can be mapped to Table objects), homogeneous datasets (that can be mapped to Array objects) or variable length record datasets (that can be mapped to VLArray objects). Besides, if a dataset is not supported, it will be mapped to a special UnImplemented class (see The UnImplemented class), that will let the user see that the data is there, although it will be unreachable (still, you will be able to access the attributes and some metadata in the dataset). With that, PyTables probably can access and modify most of the HDF5 files out there.
Data compression: Supports data compression (using the Zlib, LZO, bzip2 and Blosc compression libraries) out of the box. This is important when you have repetitive data patterns and don’t want to spend time searching for an optimized way to store them (saving you time spent analyzing your data organization).
High performance I/O: On modern systems storing large amounts of data, tables and array objects can be read and written at a speed only limited by the performance of the underlying I/O subsystem. Moreover, if your data is compressible, even that limit is surmountable!
Support of files bigger than 2 GB: PyTables automatically inherits this capability from the underlying HDF5 library (assuming your platform supports the C long long integer, or, on Windows, __int64).
Architecture-independent: PyTables has been carefully coded (as HDF5 itself) with little-endian/big-endian byte ordering issues in mind. So, you can write a file on a big-endian machine (like a Sparc or MIPS) and read it on other little-endian machine (like an Intel or Alpha) without problems. In addition, it has been tested successfully with 64 bit platforms (Intel-64, AMD-64, PowerPC-G5, MIPS, UltraSparc) using code generated with 64 bit aware compilers.

http://www.pytables.org/

Large Data Analysis With Python - http://www.pytables.org/docs/LargeDataAnalysis.pdf

ViTables is a component of the PyTables family. It is a GUI for browsing and editing files in both PyTables and HDF5 formats. It is developed using Python and PyQt (the Python bindings to the Qt ), so it can run on any platform that supports these components.

http://vitables.org/

Deck Chairs and Fiddles

Wednesday, April 19, 2017

PyTables

No comments:

Post a Comment