MIR file format =============== MIR is backed up by the `HDF5 `_ data file format. HDF5 is an extensible and open standard, comprised of platform independent technologies which are available under Open Source licenses. HDF5 format has been designed with the objective of creating data storages that are self-descriptive, flexible, and have extremely fast and efficient access patterns to the stored data. .. note:: Although not mandatory, MIR data files should have the ``.mir`` extension (lowercase), for easier identification on the file system. HDF5 key concepts ----------------- Although HDF5 key concepts are best explained by the `HDF5 User's Guide `_, a quick overview is given bellow for readers convenience: Group A collection of HDF5 objects (including other groups). As suggested by its name "Hierarchical Data Format", an HDF5 file is hierarchically structured, like a tree. This tree structure is very similar to the file system structures employed on UNIX systems, with directories and files. HDF5 groups are analogous to the directories, HDF5 datasets are analogous to the files. Every HDF5 data file has at least one object, the root group. All other objects (including groups) are either members of the root group or its descendants. Dataset A multidimensional array of data elements, together with supporting metadata. An HDF5 dataset is an object composed of a collection of data elements and metadata that stores a description of the data elements, data layout, and all other information necessary to write, read, and interpret the stored data. Data type A description of a specific class of data element including its storage layout as a pattern of bits. HDF5 data types implement a flexible, extensible, and portable mechanism for specifying and discovering the storage layout of the data elements, determining how to interpret the elements (for example, as floatĀ­ing point numbers), and for transferring data from different compatible layouts. Atomic data types are indivisible. Composite data types are composed of multiple elements of atomic data types. In addition to the standard types, users can define additional custom data types. Attribute A small metadata object attached to a group or a dataset. Attributes are a critical part of what makes HDF5 a "self-describing" format. They are small named pieces of data attached directly to group or dataset objects. This is the official way to store metadata in HDF5. Mapping MIR concepts onto HDF5 ------------------------------ In general, MIR concepts map cleanly onto their HDF5 counterparts: ========= ========= MIR HDF5 ========= ========= Data type Data type Attribute Attribute Dataset Dataset Layer Group ========= ========= Detailed MIR/HDF5 data type map: - integer numbers are stored as 32-bit signed integers, using the ``H5T_STD_I32LE`` HDF5 data type - floating point numbers are stored as IEEE 754 binary64, using the ``H5T_IEEE_F64LE`` HDF5 data type - all Unicode strings are UTF-8 encoded Additionally, each MIR layer (HDF5 group) is a child of the root HDF5 group. Recommended optimizations ------------------------- .. _hdf5-chunking: Chunked storage layout ^^^^^^^^^^^^^^^^^^^^^^ What is chunking: The storage layout defines how the raw data values in the dataset are physically stored on disk. There are three ways that a dataset can be stored: contiguous, chunked, and compact. If the storage layout is contiguous, then the raw data values will be stored physically adjacent to each other in the HDF5 file (in one contiguous block). This is the default layout for a dataset. With a chunked storage layout the data is stored in equal-sized blocks or chunks of a pre-defined size. The HDF5 library always writes and reads the entire chunk. Each chunk is stored as a separate contiguous block in the HDF5 file. There is a chunk index which keeps track of the chunks associated with a dataset. Chunked storage makes it possible to resize datasets, and because the data is stored in fixed-size chunks, to use compression filters. -- https://support.hdfgroup.org/HDF5/Tutor/layout.html Although not mandatory, it's still strongly advised to use the chunked storage layout when creating MIR data files to improve processing performance: It is commonly used when subsetting very large datasets. Using the chunking layout can greatly improve performance when subsetting large datasets, because only the chunks required will need to be accessed. However, it is easy to use chunking without considering the consequences of the chunk size, which can lead to strikingly poor performance. If a very small chunk size is specified for a dataset it can cause the dataset to be excessively large and it can result in degraded performance when accessing the dataset. The smaller the chunk size the more chunks that HDF5 has to keep track of, and the more time it will take to search for a chunk. An entire chunk has to be read and uncompressed before performing an operation. There can be a performance penalty for reading a small subset, if the chunk size is substantially larger than the subset. -- https://support.hdfgroup.org/HDF5/Tutor/layout.html Compression and error detection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When an HDF5 dataset is created, optional filters can be specified. These filters are added to the data transfer pipeline when data is read or written. The HDF5 standard library includes filters to implement transparent compression, data shuffling, and error detection code. .. warning:: To apply a filter to an HDF5 dataset it must be created with a :ref:`chunked storage layout `. Although not mandatory, it's still strongly advised to apply the following filters when creating MIR data files to minimize disk storage requirements and improve processing performance: Compression Enable the transparent compression filter to save storage space. Data is compressed on the way to disk, and automatically decompressed when read. Once the dataset is created with a particular compression filter applied, data may be read and written as normal with no special steps required. Although many interesting `3rd party compression filters are supported `_, HDF5 itself provides only 2 pre-defined filters for compression by default: ZLIB and SZIP. SZIP can't be used freely due to licensing issues, therefore ZLIB is recommended for maximum portability. For best overall ZLIB performance the `PyTables optimization guide `_ advises the lowest compression level (1) to be used. Shuffle Enable the shuffle filter to improve the compression ratio. Block-oriented compressors work better when presented with runs of similar values. The shuffle filter rearranges the bytes in the chunk and may improve compression ratio. No significant speed penalty. Fletcher32 Adds an error detection checksum to each chunk to detect data corruption. Attempts to read corrupted chunks will fail with an error. No significant speed penalty.