Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

HEP001: Column-Oriented Tabular Data in HDF5

HDF Group

1Introduction

HDF5 has stored tabular data from its earliest days, and several distinct idioms have grown up around that use case. HEP001 proposes a column-oriented storage layout, in the spirit of Apache Parquet, Apache Arrow, and Feather, that lives natively as an HDF5 group and combines cleanly with the multidimensional array datasets that are HDF5’s traditional strength.

1.1A short overview of tabular data in HDF5

The first adopted idiom is the HDF5 Table specification, part of the HDF5 High-Level Library and implemented through its H5TB API. A table is a single one-dimensional dataset whose datatype is an HDF5 compound (record) type, decorated with attributes such as CLASS="TABLE", VERSION, TITLE, FIELD_0_NAME … FIELD_N_NAME, FIELD_0_FILL … FIELD_N_FILL, and NROWS. Rows of the logical table become elements of the dataset; columns become fields of the compound datatype. This layout is simple and portable, but it is fundamentally row-oriented: every row occupies contiguous bytes, every column shares the same chunking, and changing a single column’s datatype, chunk shape, or compression filter requires rewriting the entire dataset.

The second influential idiom is PyTables, a Python package that has layered a rich query engine on top of HDF5. PyTables likewise stores a table as a single one-dimensional dataset of a compound type decorated with its own CLASS="TABLE", VERSION, TITLE, FIELD_N_FILL, NROWS, and PYTABLES_FORMAT_VERSION attributes. This adds a rich family of companion structures for indexing, in-kernel queries, and compression with Blosc. PyTables popularized the idea that a useful table format needs more than data bytes: it needs indexes, metadata, and conventions that let tools reason about the data.

The third and most recent influence is Anndata, which treats a table (dataframe) as an HDF5 group rather than a single compound dataset. Each column of the dataframe is stored as its own one-dimensional dataset inside that group. The group carries attributes that tell Anndata how to reassemble the columns into a dataframe — encoding-type="dataframe", encoding-version, column-order (a UTF-8 string array giving the column order), and _index (the name of the column that supplies row labels). Anndata extends this convention with dedicated encodings for nullable integers, nullable booleans, categoricals, sparse matrices, and more. This layout is column-oriented, and it is the closest existing HDF5 practice to what modern analytical engines expect.

1.2Why columnar, and why now

Row-oriented tables pack every value of every column together. That packing is ideal for use cases that scan whole rows (appending rows to a log, reading a few complete records by index) but it imposes three practical limits:

  1. Uniform chunking and filtering. Every column in the table shares the same chunk shape and the same filter pipeline, because both are properties of the single compound dataset. A wide table that mixes a dense float column (well-served by shuffle + Zstd) with a high-cardinality string column (well-served by a dictionary encoding or Blosc bitshuffle) is forced to compromise.

  2. Whole-row I/O for column queries. Selecting one column out of a hundred still reads every column’s bytes, because those bytes are interleaved within each chunk. Analytical workloads routinely scan a few columns of a wide table, and row orientation amplifies I/O proportional to row width.

  3. Schema evolution. Adding or removing a column changes the compound datatype of the whole dataset, which in HDF5 terms means rewriting every chunk. Columnar layout makes schema evolution a matter of creating or deleting a sibling dataset.

Columnar formats such as Parquet, ORC, and Arrow Feather have become the lingua franca of analytical tabular data precisely because they decouple each column’s physical storage. HEP001 brings the same property to HDF5 while keeping everything the HDF5 ecosystem already has: a single container file, portable self-describing metadata, hierarchical groups, and lossless access to every tool in the HDF5 stack.

A decisive advantage of HDF5 over dedicated columnar formats is that tabular data does not have to live alone. For example, an HDF5 file can hold:

Links, object references, and region references can tie rows of the table to slabs of the image cube without duplicating bytes. Analysts can query the table to identify events of interest and then dereference the rows into pixel-space regions in the same file, on the same storage, in a single API. Columnar tools and array tools meet in the middle.

1.3Scope and non-goals

HEP001 specifies:

HEP001 does not specify:

2Conformance

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY in this document are to be interpreted as described in RFC 2119 and RFC 8174 when, and only when, they appear in all capitals.

A file, group, or dataset is HEP001-conformant when it satisfies every MUST in the section that applies to it. A producer is HEP001-conformant when every table group it writes is conformant; a consumer is conformant when it can read any conformant table group without data loss.

3Terminology

The following terms are used throughout this specification.

Table group
An HDF5 group that represents one column-oriented table. Identified by the CLASS attribute (see §7.1).
Dataset name
The name of the HDF5 hard link that connects an HDF5 dataset with its parent HDF5 group.
Column dataset
An HDF5 dataset of rank 1 that is a direct child of a table group and represents one column of the table. Datasets inside the reserved CATEGORIES and SEARCH_INDEXES subgroups are not column datasets. The dataset’s name is the column name.
Row index column
A column dataset referenced by the table group’s INDEX_COLUMNS attribute and which therefore supplies row labels for the table. Row index columns are otherwise indistinguishable from any other column dataset; the designation is made at the table-group level, not on the column itself.
Row
A position i in the half-open range [0, NROWS) (see §7.3) within every column dataset of the table group. Every column dataset MUST have the same first-dimension extent and that extent MUST be ≥ NROWS, so the same i refers to the same logical row everywhere.
NROWS
The number of logical rows currently in the table. A scalar uint64 attribute on the table group, defined in §7.3.
Categories dataset
An HDF5 dataset of rank 1 stored under the CATEGORIES child group of a table group that holds the label values backing one or more categorical columns. See Section 8.7.
Search index dataset
An HDF5 dataset stored under the SEARCH_INDEXES child group of a table group that accelerates queries over one or more column datasets. Each kind of search index is specified in Search indexes.

4Data model overview

A HEP001 table is an HDF5 group whose direct children are the table’s columns (one or more of which MAY be designated as row index columns via the table group’s INDEX_COLUMNS attribute) and, optionally, two reserved subgroups: CATEGORIES, holding the label datasets that back categorical columns, and SEARCH_INDEXES, holding query-acceleration structures. The table’s authoritative row count lives in the table group’s NROWS attribute (see §7.3); every column dataset has the same first-dimension extent, and rows [0, NROWS) are the table’s data.

The rest of this document specifies each building block: the table group (The table group), column datasets (Column datasets), row index columns (Row index columns), and the four kinds of search-index dataset (Search indexes).

5Object references

Several HEP001 attributes link objects within a table group by HDF5 object reference: INDEX_COLUMNS (§7), CATEGORIES (§8.7), SEARCH_INDEX_LIST (§10), and VALUES (§10.6). Every such reference MUST use the HDF5 standard reference datatype H5T_STD_REF (introduced in HDF5 1.12). H5T_STD_REF is a unified datatype that can carry object, dataset-region, and attribute references (so future HEP revisions can introduce finer-grained linkages without a new on-disk reference format).

Producers MUST NOT write the deprecated object-reference datatype H5T_STD_REF_OBJ. A consumer MAY reject, as non-conformant, any HEP001 reference attribute whose datatype is not H5T_STD_REF.

6Boolean attributes

HDF5 has no native boolean datatype, and the wider HDF5 ecosystem has not converged on one encoding. HEP001 fixes a single, self-describing form so that boolean attributes are unambiguous on disk.

Every attribute that this specification calls boolean MUST be stored as an HDF5 enumerated datatype with:

In HDF5 DDL (as emitted by h5dump), the datatype is described as:

H5T_ENUM {
   H5T_STD_I8LE;
   "FALSE"    0;
   "TRUE"     1;
}

Producers MUST NOT store a HEP001 boolean as a plain integer, an H5T_BITFIELD, or a string. A consumer determines truth from the enumerated integer value (0 = false, 1 = true).

7The table group

7.1Identification — the CLASS attribute

Every HEP001 table group MUST carry an attribute named CLASS with:

A consumer MUST identify a group as a HEP001 table group by, and only by, the presence of a CLASS attribute whose string value equals COLUMN_TABLE. A producer MUST NOT write CLASS="COLUMN_TABLE" attribute on any group that does not satisfy the rest of this specification.

7.2The VERSION attribute

Every HEP001 table group MUST carry a scalar, fixed-length ASCII attribute named VERSION whose value is the HEP001 revision the table conforms to. Producers MUST size the attribute to hold the value being written. For this revision the value is "1.0".

HEP001 uses a two-component MAJOR.MINOR version with the major/minor semantics of Semantic Versioning: MAJOR increments on a backward-incompatible change to the data model (one an existing conformant consumer can no longer read), and MINOR on a backward-compatible addition. HEP001 omits SemVer’s third (PATCH) component deliberately since it would carry no actionable meaning. A consumer that prefers a SemVer-style triple MAY read an absent PATCH component as 0 — so "1.0" is equivalent to "1.0.0". Consumers MUST compare VERSION values numerically and MUST refuse to process a table whose MAJOR exceeds the highest MAJOR they implement.

7.3The NROWS attribute

Every HEP001 table group MUST carry a scalar NROWS attribute of datatype uint64 whose value is the number of logical rows currently in the table. NROWS is the table’s authoritative row count: rows [0, NROWS) of every column dataset are part of the table, and rows [NROWS, extent), when present, are reserved storage.

NROWS and a column dataset’s first-dimension extent are related but not equal in general. Every column’s extent MUST be ≥ NROWS (see §12), but it MAY exceed NROWS when a producer has preallocated space for future appends. A consumer MUST determine the table’s row count from NROWS and MUST NOT infer it from extent — doing so would silently include preallocated slots or post-crash residue as if they were table rows.

The attribute is borrowed from the long-established HDF5 Table and PyTables conventions, where it plays the same role. Centralizing the row count in one place — independent of any column’s storage state — gives producers a single-attribute commit point for atomic appends (see §11) and gives consumers a single place to look for the table’s size.

For a freshly created, empty table, NROWS = 0. Every column dataset’s first-dimension extent MUST be ≥ 0 (a zero-length column is permitted and is the natural form of a brand-new table).

7.4Optional table group attributes

The table group MAY carry the following attributes.

TITLE
Scalar fixed-length UTF-8 string. Human-readable title of the table, mirroring HDF5 Table and PyTables. Purely descriptive.
INDEX_COLUMNS
One-dimensional attribute of HDF5 object references whose elements point to the column datasets that serve as row labels for this table, in hierarchical order from outermost to innermost level. For example, a table indexed by (donor_id, sample_id, cell_id) writes INDEX_COLUMNS = [ref(donor_id), ref(sample_id), ref(cell_id)]. An empty array or absent attribute means the table has no row labels and rows are positional only. Every reference MUST resolve to a column dataset that is a direct child of the table group and MUST NOT be a null reference. Row index columns apply to the table as a whole — every column in the table is labeled by them. See Row index columns.
column-order
One-dimensional fixed-length UTF-8 string attribute whose elements are the names of the column datasets in their logical order. When present, it fully determines the column order presented to users; when absent, the logical column order is implementation-defined. Producers SHOULD write column-order whenever a table has more than one column. The attribute name uses a hyphen (not an underscore) to match Anndata.
_index
Scalar fixed-length UTF-8 string. The dataset name of the column that supplies the primary row labels for this table (for example, "row_id"). The name is borrowed from Anndata’s _index. When both INDEX_COLUMNS and _index are present, _index MUST equal the dataset name of the column referenced by INDEX_COLUMNS[0].
encoding-type / encoding-version
Scalar fixed-length UTF-8 strings, both optional. These names are borrowed from Anndata’s DataFrame convention (§13.3). A producer MAY set encoding-type="dataframe" and encoding-version="0.2.0" if it separately wants Anndata software to attempt reading the group, but HEP001 neither requires nor interprets them and makes no guarantee that the result is a usable Anndata DataFrame (§15).
description
Scalar fixed-length UTF-8 string. Free-text description of the table’s contents. Longer and richer than TITLE; intended for documentation viewers.
units_vocabulary
Scalar fixed-length UTF-8 string. Names the vocabulary or authority that interprets units strings on columns of this table — for example "UDUNITS-2", "UCUM", "QUDT", or a URL. When present on the table group, it applies as a default to every column whose own units_vocabulary is absent.

7.5Placement of the table group

The table group MAY be located anywhere in an HDF5 file’s hierarchy. In the simplest case the root group of the file is the table group — the file holds one table and nothing else — and all columns live at the top level. Tables MAY also be nested under named groups to organize many tables, or sited beside multidimensional array datasets that they describe.

7.6Self-contained contents

A table group is the complete representation of a single column-oriented table. Every HDF5 object — group, dataset, named datatype, or link — that is a descendant of the table group MUST exist solely in service of the data model defined here.

The only objects permitted anywhere in the HDF5 hierarchy below a table group are:

Any descendant of a table group that does not match one of the categories above is non-conformant. A producer MUST NOT place under a table group:

Producers that wish to colocate such content with a table MUST do so by placing it as a sibling of the table group, or under an unrelated parent group elsewhere in the file. Object references, region references, and external links MAY be used to associate the table’s columns with that external content without violating this rule.

A consumer that encounters a descendant of a table group not described by this revision SHOULD emit a diagnostic identifying the offending HDF5 path. A consumer operating in strict-conformance mode MAY refuse to process the table; a consumer operating in lenient mode MAY ignore the offending descendant and continue.

Consumers MAY treat the table group as a closed, self-describing unit: copying, moving, renaming, deleting, or versioning the table group is guaranteed to act on the entire table and on nothing else.

8Column datasets

8.1Required properties

Each column of a HEP001 table MUST be stored as an HDF5 dataset that:

The name of the HDF5 dataset is the column name. Any name acceptable as an HDF5 link name (UTF-8, excluding / and NUL) is permitted. Producers MUST NOT use any HEP001 reserved name (Section 13.2) as a column name. Producers SHOULD also avoid names that begin with an underscore, which are reserved for Anndata-aligned attribute names (_index) and may be claimed by future HEPs.

8.2Column discovery

The only HDF5 datasets that are direct children of a table group are its column datasets: categories datasets live one level down inside the CATEGORIES subgroup (§8.7), and search-index datasets inside the SEARCH_INDEXES subgroup (§10). A consumer therefore enumerates a table’s columns as exactly the rank-1 datasets that are direct children of the table group; the CATEGORIES and SEARCH_INDEXES children are groups, not datasets, and are skipped automatically.

When column-order (§7) is present it lists exactly these column datasets and fixes their order; when absent, the set of columns is still fully determined by this rule, and only their order is implementation-defined.

8.3Datatypes

The datatype of a column dataset MAY be any HDF5 datatype. Consumers that encounter a datatype they do not recognize SHOULD expose the column’s raw datatype instead of quietly ignoring it.

Two caveats apply:

  1. Compound datatypes at the column level are permitted, but a table group is not a mechanism for storing a nested row-oriented table. When a compound-typed column is used, its fields MUST be logically atomic.

  2. Variable-length datatypes (e.g. variable-length UTF-8 strings, ragged numeric arrays) are permitted but generally discouraged, as their storage characteristics are poorly suited to columnar access patterns. Producers SHOULD prefer fixed-length equivalents where possible.

8.4Chunking and filters

Because each column is its own HDF5 dataset, each column MAY independently select:

This per-column flexibility is the core motivation for HEP001 and is normative: producers MUST NOT require columns to share chunk shape or filters. Consumers MUST treat each column’s storage layout independently.

8.5Missing values (fill values)

A column dataset’s HDF5 fill value identifies rows whose value is missing. Producers MUST set the dataset’s fill value explicitly via the dataset creation property list (H5Pset_fill_value), placing the dataset into the H5D_FILL_VALUE_USER_DEFINED state, and MUST choose a fill value that lies outside the column’s logical value range. A producer MAY declare that range explicitly with the two attributes valid_min and valid_max on the column dataset (each a scalar of the column’s element datatype). If present, the chosen fill value MUST lie strictly outside [valid_min, valid_max].

Consumers MUST retrieve the fill value (H5Dget_fill_value) and identify missing rows with the canonical missing-value test:

missing(value, fill_value) = isnan(fill_value) ? isnan(value) : value == fill_value

For every column whose fill value is not a NaN bit pattern — including every column whose datatype is not floating-point, and every column whose producer chose any of the recommended fill values in Table 1 — the isnan(fill_value) branch is always false and the test reduces to ordinary bit-equality value == fill_value. The same applies even to columns that contain no missing values — no row satisfies the test, and the column simply has no missing rows.

For floating-point columns with a NaN bit pattern as the fill value, the test reduces to isnan(value). IEEE 754 makes NaN != NaN, so a literal value == fill_value test would miss every fill-marked row in such columns; consumers MUST use the isnan(value) branch instead. HEP001 does not recommend NaN as a fill value — it conflates “the producer marked this row missing” with “the result of a floating-point computation was indeterminate” — but it permits NaN for producers who need a zero-cost round-trip with NaN-native ecosystems.

For producers that have no domain-specific constraint forcing a different choice, the table below lists recommended fill values for each datatype family.

Table 1:Recommended fill values

HDF5 datatypeRecommended fill valueHex (canonical bit pattern)
int8-1270x81
int16-327670x8001
int32-21474836470x80000001
int64-92233720368547758070x8000000000000001
uint82550xFF
uint16655350xFFFF
uint3242949672950xFFFFFFFF
uint64184467440737095516150xFFFFFFFFFFFFFFFF
float329.9692099683868690e+360x7CF00000
float649.9692099683868690e+360x479E000000000000
fixed string""n/a
vlen string""n/a
enumerationMISSINGn/a

The recommended sentinels above are chosen as follows.

Integers. Signed-integer sentinels are INT*_MIN + 1 rather than INT*_MIN itself, leaving a one-value safety margin against operations that land on the type’s minimum (e.g., abs(INT*_MIN) is undefined behavior in C). Unsigned sentinels are the type’s maximum.

Floating point. Two choices are acceptable:

  1. The recommended non-NaN sentinel 9.9692099683868690e+36. Exactly representable in both float32 and float64, and preserves bit-identity under width casts, so the equality branch of the canonical missing-value test works without rounding-tolerance gymnastics. This is the recommended choice for new tables that do not need byte-level round-tripping with NaN-native ecosystems.

  2. Any NaN bit pattern (quiet, signaling, signed, with arbitrary payload). The canonical missing-value test then takes the isnan(value) branch. This choice is permitted but not recommended on its own merits; producers should use it when zero-cost interoperability with pandas, Anndata, NumPy, R, or other NaN-as-missing ecosystems matters more than disambiguating “missing” from “result of an indeterminate computation”.

Strings. The empty string is the recommended sentinel because it is rarely a meaningful value in practice.

Enumerations. The recommended sentinel is a designated enum member named MISSING. Producers using an enum column with missing values MUST include such a member in the enum type definition; the integer code backing the MISSING member then serves as the column’s fill value.

For datatypes not in the table:

Producers whose column domain includes any of the recommended sentinels above MUST choose a different fill value via H5Pset_fill_value. For columns with a natural numeric range (integer and floating-point columns), producers MUST also declare that range via valid_min and valid_max and choose the fill outside it. For columns without a natural numeric range (strings, opaque), the alternative fill value alone constitutes the override; no additional attribute is required.

8.6Column attributes

A column dataset MAY carry any of the following attributes.

SEARCH_INDEX_LIST
One-dimensional array of HDF5 object references. Each reference points to a search-index dataset (see §10) in the SEARCH_INDEXES subgroup that accelerates queries on this column.
CATEGORIES
Scalar HDF5 object reference. Used only for categorical columns. Points to the dataset that holds the categorical values (see §8.7).
valid_min / valid_max
Scalar attributes of the same datatype as the column, specifying the minimum and maximum range of the column’s values. See §8.5.
units
Scalar fixed-length UTF-8 string. Physical units of the column’s values. Absence of units implies dimensionless data. Units are interpreted according to units_vocabulary (on the column, or inherited from the table group).
units_vocabulary
Scalar fixed-length UTF-8 string. Identifies the units vocabulary that interprets units. When present on a column, it overrides the table group’s units_vocabulary for that column. MAY be a short name ("UDUNITS-2") or a URL.
description
Scalar fixed-length UTF-8 string. Plain-text description of the column’s contents, provenance, or semantics.

8.7Categorical columns

A categorical column stores integer codes that index into a separate categories dataset holding the actual label values. The code at row i is the zero-based position of that row’s label in the categories dataset.

The CATEGORIES group

A table group MAY contain a direct child group named CATEGORIES. When present, it MUST hold every categories dataset for the table, and no other objects. It carries no required attributes of its own. Categories datasets in CATEGORIES MAY have any name; the binding between a column and its categories dataset is the column’s CATEGORIES object reference (§5), never a parsed name.

Categorical column requirements

A categorical column MUST:

  1. Have an integer datatype (any width, signed or unsigned). The column’s missing value (the dataset’s fill value) denotes a missing category.

  2. Carry a scalar CATEGORIES attribute whose value is an HDF5 object reference resolving to a categories dataset in the table group’s CATEGORIES subgroup.

A categories dataset MUST:

  1. Live directly under the table group’s CATEGORIES subgroup.

  2. Have rank 1, with any datatype appropriate to the label values.

A categories dataset MAY:

  1. Carry a scalar boolean attribute ordered (matching Anndata’s ordered categoricals), encoded per §6. Producers MUST set ordered to true exactly when the order of entries in the categories dataset is semantically meaningful.

  2. Back more than one categorical column. Several categorical columns that share a common label set MAY reference the same categories dataset through their CATEGORIES attributes; producers need not duplicate a shared code book.

A categories dataset is not a column dataset: it does not count toward the table’s columns, and it MUST NOT appear in the table group’s column-order (§12).

9Row index columns

A row index column is a column dataset referenced by the table group’s INDEX_COLUMNS attribute (see §7). Row index columns supply row labels for the table as a whole — every column in the table is labeled by every row index column. They are ordinary column datasets in every other respect, and they SHOULD also appear in column-order like any other column.

9.1Hierarchy

When INDEX_COLUMNS contains more than one reference, the order is the row-label hierarchy from outermost to innermost level. For example, INDEX_COLUMNS = [ref(donor_id), ref(sample_id), ref(cell_id)] declares a three-level row index in which donor_id is the outermost grouping and cell_id is the innermost row identifier.

9.2Typical uses

10Search indexes

Search indexes accelerate queries over column values. They do not change the logical table; they are derivative, recomputable data. A conformant consumer MAY ignore any or all search indexes and still return correct answers, only more slowly.

10.1The SEARCH_INDEXES group

A table group MAY contain a direct child group named SEARCH_INDEXES. When present, it MUST hold every search-index dataset for the table, together with any accompanying datasets those search indexes require and no other objects. It carries no required attributes of its own. Datasets in SEARCH_INDEXES MAY have any name.

A search-index dataset is distinguished from an accompanying dataset by the KIND attribute (§10.3): every search-index dataset MUST carry KIND, and an accompanying dataset MUST NOT carry KIND.

10.2Linking columns to search indexes

Each column dataset that benefits from a search index MUST reference that search index from its own SEARCH_INDEX_LIST attribute — a 1-D array of HDF5 object references to search-index datasets in the SEARCH_INDEXES subgroup. The column-side attribute is the only linkage; search-index datasets do not carry a back-pointer to the columns they accelerate. To determine the column that a given search-index dataset applies to, scan the column datasets of the table group and identify the one whose SEARCH_INDEX_LIST references that search-index dataset.

10.3Common per-index attributes

Every search-index dataset MUST carry a scalar fixed-length ASCII attribute KIND whose value is one of the strings defined below:

KIND valuePurposeDefined in
CHUNK_MINMAXPer-chunk min and max of an orderable column§10.4
SORTED_ROWSRow-position permutation of a column§10.5
BITMAPPer-value bitmap of a low-cardinality col.§10.6
CHUNK_BLOOMPer-chunk Bloom filter of a column§10.7

Future HEPs MAY register additional KIND values. Consumers MUST treat unknown KIND values as “ignore this search index”.

A search-index dataset MAY also carry a description attribute (scalar fixed-length UTF-8 string), per the descriptive-annotation convention used elsewhere in this spec (see Reserved names, rule 4). Producers SHOULD use description to record provenance — the timestamp of index construction, the producer software and version, and any hyperparameters not captured by the index family’s own attributes. Consumers MAY ignore description for query-execution purposes; it is purely informational.

10.4Chunk min/max search index (KIND = CHUNK_MINMAX)

A chunk min/max index accelerates range and equality predicates over an orderable column by letting the engine skip chunks whose value range does not overlap the predicate. The column’s datatype MUST have a HEP001-defined order — the same orders enumerated for SORTED_ROWS under Ordering in §10.5 (integers, floating-point, boolean, strings, opaque, and enumerations) — and min and max are computed under that order. The datatypes that SORTED_ROWS excludes for lack of a defined order (object and region references, compound, array, and variable-length-array datatypes) likewise MUST NOT carry a CHUNK_MINMAX index.

Shape: The search-index dataset is 1-D with length equal to the number of chunks of the source column dataset that contain logical-table rows:

Chunks lying entirely in the column’s preallocated tail ([NROWS, extent), see §7.3) are not indexed.

Datatype: An HDF5 compound datatype with the following fields, in declaration order:

Semantics: min and max are computed over only those elements of the chunk that the canonical missing-value test (§8.5) classifies as non-missing, and that are themselves orderable. For floating-point columns this means: NaN values are excluded (because IEEE 754 does not order them, regardless of whether NaN is the column’s fill value), and elements that match the column’s non-NaN fill value (if any) are excluded. For integer and other orderable types, only elements equal to the column’s fill value are excluded. When the chunk has no non-missing, orderable elements (fill_count + nan_count == n for floating-point columns, or fill_count == n for other orderable types), min and max MUST be set to the column’s fill value.

Applicability: Each CHUNK_MINMAX search-index dataset applies to exactly one column, identified by the column whose SEARCH_INDEX_LIST references it. The column’s HDF5 datatype MUST have a HEP001-defined order, as enumerated for SORTED_ROWS under Ordering (§10.5); otherwise no CHUNK_MINMAX index is permitted. A producer MAY build separate CHUNK_MINMAX indexes for several columns, but a single search-index dataset MUST NOT cover multiple columns because chunks of different columns are independent.

Additional attributes on the search-index dataset:

All other per-index data lives in the compound datatype’s fields (min, max, nan_count, fill_count, n); these are HDF5 datatype fields, not HDF5 attributes.

10.5Sorted-row permutation index (KIND = SORTED_ROWS)

A sorted-row index stores row positions in sorted order of a column’s values, enabling binary search and range scans without reading the full column.

Shape: 1-D, length equal to NROWS (see §7.3). Rows of the source column lying in the preallocated tail [NROWS, extent) are not part of the permutation.

Datatype: An unsigned integer wide enough to address every row of the source column (typically uint64).

Semantics: Element i of the index is the row position r such that, under the ordering defined below, the i-th rank of the column’s values lives at row r. Ties between rows with identical values MUST be broken by increasing r, so the permutation is total and deterministic.

Ordering: A SORTED_ROWS index MUST only be built over a column whose HDF5 datatype has a HEP001-defined order, as enumerated below:

The following datatypes do not have a defined sorting order, so a SORTED_ROWS index MUST NOT be built on them:

Rows whose value matches a non-NaN column fill value under the canonical missing-value test (§8.5) MUST appear immediately before the NaN tail (if any), in increasing r order, and the fill_tail_length attribute MUST record the count. For datatypes that have no NaN concept, the NaN tail is empty (nan_tail_length = 0) and the fill-tail rows appear at the very end of the permutation. When the column’s fill value is itself NaN, all missing rows are NaN rows and are placed in the NaN tail (see the floating-point ordering rule above); the fill tail is then empty (fill_tail_length = 0).

Additional attributes:

Applicability: Each SORTED_ROWS search-index dataset applies to exactly one column, identified by the column whose SEARCH_INDEX_LIST references it. The column’s HDF5 datatype MUST have a HEP001-defined order, as enumerated under Ordering above; otherwise no SORTED_ROWS index is permitted.

10.6Bitmap index (KIND = BITMAP)

A bitmap index accelerates equality predicates on low-cardinality columns.

Shape: 2-D of shape (K, ceil(NROWS / 8)), where K is the number of distinct values (or categories) indexed and NROWS is the table’s row count (see §7.3). Rows of the source column lying in the preallocated tail [NROWS, extent) are not indexed.

Datatype: uint8. Bit r % 8 (where bit 0 is the byte’s least significant bit) of byte r / 8 of row k is set if the column’s value at row r equals the k-th indexed value. Because the storage element is uint8, HDF5 performs no byte swapping on read or write — the bytes on disk are exactly the bytes the producer wrote.

Accompanying values dataset: A sibling 1-D dataset, under SEARCH_INDEXES, holds the K indexed values in the same datatype as the source column. Its name is linked from the bitmap via a scalar VALUES object-reference attribute. This values dataset is an accompanying dataset, not a search-index dataset: it MUST NOT carry a KIND attribute (see §10).

Additional attributes on the bitmap dataset:

Applicability: Each BITMAP search-index dataset applies to exactly one column, identified by the column whose SEARCH_INDEX_LIST references it.

10.7Per-chunk Bloom filter index (KIND = CHUNK_BLOOM)

A per-chunk Bloom filter accelerates equality predicates on high-cardinality columns by giving a fast negative answer for chunks that provably do not contain the queried value.

Shape: 2-D of shape (n_chunks, m_bytes), where n_chunks is the number of chunks of the source column that contain logical-table rows (ceil(NROWS / chunk_length) for a chunked column, 1 for a contiguous column, or 0 when NROWS = 0; see §7.3) and m_bytes is the Bloom-filter byte length per chunk (constant across chunks). Chunks lying entirely in the column’s preallocated tail [NROWS, extent) are not indexed.

Datatype: uint8. Each row is the packed bit array of one chunk’s Bloom filter. Because the storage element is uint8, HDF5 performs no byte swapping on read or write — the bytes on disk are exactly the bytes the producer wrote.

Bit packing: Filter bit g (where 0 ≤ g < m_bits) is stored at bit position g % 8 (where bit 0 is the byte’s least significant bit) of byte g / 8 of the chunk’s row. Equivalently, setting filter bit g is row[g >> 3] |= (uint8_t)1 << (g & 7). m_bits MUST be a multiple of 8, so that m_bytes = m_bits / 8 is exact.

Hash scheme: HEP001 prescribes — for interoperability — Kirsch–Mitzenmacher double hashing h_i(x) = (h_a(x) + i * h_b(x)) mod (8 * m_bytes) for i = 0 … k − 1. Both h_a and h_b come from a single invocation of MurmurHash3_x64_128 (the 128-bit, 64-bit-tuned variant of MurmurHash3) over the value’s canonical byte representation (see below), seeded with the value of the seed attribute. The low 64 bits of the 128-bit output become h_a; the high 64 bits become h_b. The number of hash functions k is stored as an attribute.

Canonical byte representation: Bloom-filter hashes MUST be computed over a column value’s canonical byte representation, defined here. The canonical form is independent of how the column is stored on disk: a column written in big-endian MUST be transcoded to its canonical form before hashing, and a consumer evaluating a query MUST canonicalize the query value identically before testing the filter.

Additional attributes:

Applicability: Each CHUNK_BLOOM search-index dataset applies to exactly one column, identified by the column whose SEARCH_INDEX_LIST references it. The column’s HDF5 datatype MUST belong to one of the kinds enumerated under Canonical byte representation; otherwise no CHUNK_BLOOM index is permitted.

11Writing and appending data

This section specifies how producers add, remove, or rewrite rows of a HEP001 table and how consumers interpret the table during and after those operations. The contract is anchored on the table group’s NROWS attribute (§7.3).

11.1How consumers interpret NROWS

NROWS is the authoritative count of rows currently in the table. A consumer MUST treat rows [0, NROWS) of every column dataset as the table’s data and MUST ignore rows [NROWS, extent) of any column dataset, even when those rows hold values that are not the column’s fill value. The tail is reserved storage, not data; its contents have no semantic meaning under HEP001 and MAY contain arbitrary bytes left behind by a previous write, a previous truncation, or HDF5’s own fill-value mechanism.

The same cutoff applies to search indexes. A consumer MUST consult only those index entries that describe rows or chunks within [0, NROWS). Tail entries — for example, a CHUNK_MINMAX row describing a chunk that lies entirely in [NROWS, extent) — MAY be present as residue from preallocation or from a previous, larger table state, and MUST be ignored.

11.2Appending rows

A producer that appends K new rows to a table MUST perform the operation in the following order, treating step 5 as the single commit point that publishes the new rows to readers:

  1. Read the current NROWS (call it N_old).

  2. Extend every column dataset so that its first-dimension extent is ≥ N_old + K. A column that already has spare capacity from a prior preallocation needs no extension; the requirement is the post-condition extent ≥ N_old + K on every column.

  3. Write the new row values into rows [N_old, N_old + K) of each column. Writes MAY proceed in any order across columns.

  4. Update every search index on every affected column so that, after step 5 commits, the index correctly describes rows [0, N_old + K). For index families that support efficient incremental updates (CHUNK_MINMAX, CHUNK_BLOOM), a producer MAY simply append new entries; for families that do not (SORTED_ROWS, BITMAP), the producer typically rebuilds the index from scratch. A producer that cannot perform a consistent index update MUST delete the affected indexes — and remove the corresponding references from each column’s SEARCH_INDEX_LIST — before step 5.

  5. Commit by writing NROWS = N_old + K as the final step. The attribute update SHOULD be followed by an H5Fflush (or equivalent) before the producer reports the append as complete.

The ordering is load-bearing for crash recovery. Until step 5 commits, every consumer that opens the file still sees NROWS = N_old and therefore the table exactly as it was before the append. A producer that crashes anywhere in steps 1–4 leaves a file whose observable state is identical to the pre-append state: the unused tail in [N_old, extent) is reserved storage, and any index residue beyond N_old is ignored. No cleanup is required for readers to use the file correctly. A subsequent producer that wants to retry the append SHOULD either overwrite the unused tail or extend further.

HDF5 itself does not guarantee atomic ordering between the data writes of step 3, the index writes of step 4, and the attribute update of step 5 unless the producer issues explicit H5Fflush calls between them. A producer that requires strong durability across a crash MUST issue an H5Fflush after step 4 and again after step 5, so that the on-disk state cannot show NROWS = N_old + K together with unwritten column data or stale indexes.

11.3Preallocation

A producer MAY extend column datasets past the current NROWS to amortize the cost of H5Dset_extent across many small appends — for example, extending by one chunk’s worth of rows at a time and filling that chunk over several append batches. The extended tail is reserved storage; its contents have no semantic meaning under HEP001 until a subsequent commit (step 5 above) increases NROWS to cover them.

A producer that preallocates SHOULD ensure that all columns in the same table group remain at equal first-dimension extents after each operation (see §12). The simplest and recommended discipline is to preallocate every column by the same number of rows at the same time.

11.4Truncation

A producer MAY shrink the logical table by writing a smaller NROWS value. The truncation is logical: the column datasets MAY retain their old extents, with rows [new_NROWS, old_NROWS) becoming reserved storage. A producer that wants to reclaim physical space MUST rewrite each column dataset to its new extent — typically via the h5repack utility or an equivalent rewriting tool.

The same index-consistency rule that governs appends applies on truncation: the producer MUST update every affected search index to match the new NROWS, or delete it (and remove the corresponding SEARCH_INDEX_LIST entry on the column) before committing the new NROWS value.

11.5In-place updates

A producer MAY rewrite individual cells of the table — change the value at one or more row positions < NROWS — without altering NROWS. Producers MUST update every affected search index to reflect the new values, or delete it before committing the change. In-place updates do not benefit from the single-attribute commit point that NROWS provides; producers that require atomic semantics for in-place edits MUST arrange them externally (for example, by writing to a fresh column dataset and swapping it in under a future revision of this HEP, or by using application-level coordination outside HDF5).

12Consistency requirements

A conformant table group satisfies all of the following at all times:

  1. The table group MUST carry a scalar NROWS attribute of datatype uint64 (see §7.3).

  2. Every column dataset (including row index columns) in the same table group MUST have the same first-dimension extent, and that extent MUST be ≥ NROWS.

  3. Every search-index dataset in the SEARCH_INDEXES group MUST carry a KIND attribute. The only other datasets permitted in the SEARCH_INDEXES group are the accompanying datasets those search indexes require, and an accompanying dataset MUST NOT carry a KIND attribute.

  4. Every reference in a column’s SEARCH_INDEX_LIST MUST resolve to a search-index dataset under the table group’s SEARCH_INDEXES subgroup.

  5. Every categorical column’s CATEGORIES reference MUST resolve to a categories dataset in the table group’s CATEGORIES subgroup (§8.7). The CATEGORIES subgroup, when present, MUST contain only categories datasets, and every categories dataset MUST be referenced by at least one categorical column’s CATEGORIES attribute.

  6. column-order, when present, MUST list every column dataset of the table exactly once and MUST NOT list any dataset that is not a column dataset (in particular, not a categories dataset or a search-index dataset).

  7. Every reference in the table group’s INDEX_COLUMNS attribute (when present) MUST resolve to a column dataset that is a direct child of the table group and MUST NOT be a null reference; when _index is also present, it MUST equal the dataset name of the column referenced by INDEX_COLUMNS[0].

  8. Every categorical column’s fill value (see §8.5) MUST NOT collide with a valid integer code in the linked categories dataset’s index range, so that the canonical missing-value test (§8.5) unambiguously denotes “missing category” rather than “valid value at category index fill.”

  9. Every search-index dataset’s content MUST correctly describe its source column for row positions in [0, NROWS). Tail entries that describe positions ≥ NROWS — for example, residue from preallocation or from a prior truncation — MAY be present and MUST be ignored by consumers.

A producer that mutates a table (appends rows, truncates, rewrites a column, etc.) MUST follow §11 — either updating the affected search indexes consistently or deleting them before committing the mutation through NROWS.

13Reserved names

HEP001 follows the long-standing HDF Group High-Level API practice — established by the HDF5 Table, Image, and Dimension Scales specifications — of writing reserved attribute and group names in fixed-length ASCII, UPPERCASE. The intent is that a reader scanning an HDF5 file can tell at a glance which names belong to the specification and which were chosen by the producer of the data.

13.1Naming rules

  1. Every attribute or group name introduced as part of the HEP001 specification MUST be written in fixed-length uppercase ASCII, with underscores as the only word separator (for example, CLASS, INDEX_COLUMNS, SEARCH_INDEXES).

  2. Producers MUST NOT use any reserved name listed in Section 13.2 for a column dataset, a search-index dataset, a user-supplied attribute, or any other purpose other than the one this HEP assigns to it.

  3. Names that HEP001 deliberately borrows from other ecosystems, currently only from Anndata’s DataFrame layout (§15), are exempt from rule 1 and MUST be written exactly as those ecosystems write them. They are listed in Section 13.3.

  4. Several attributes that align with broader scientific HDF5 community practice are exempt from rule 1 and MUST be written in lowercase, so that generic metadata harvesters and existing tools can discover them on a HEP001 table without case-folding.

    The descriptive annotation attributes units, units_vocabulary, and description are lowercase and carry no contractual meaning: their presence, absence, or value does not change how a HEP001 consumer interprets the table or any of its objects. They are defined alongside the objects that may carry them (The table group, Column datasets) and do not appear in the reserved-name catalog. Any future descriptive annotation attribute introduced by this HEP or a successor MUST follow the same lowercase convention.

    The value-domain attributes valid_min and valid_max, when present on a column dataset, carry contractual meaning: the column’s fill value MUST lie strictly outside [valid_min, valid_max] (see §8.5). They appear in the reserved-name catalog (Section 13.2) despite being lowercase.

  5. Per-search-index attributes that are private to a specific search-index family — including configuration parameters, declarations of the algorithm used, and computed output metrics — are not part of the reserved name contract and are written in lowercase snake_case. They are documented with the index family that defines them (Search indexes).

  6. KIND values (the string contents of the KIND attribute) are themselves reserved tokens and follow the same uppercase rule as reserved names.

13.2Reserved name catalog

The complete set of HEP001 reserved names is listed below.

Group names

CATEGORIES
The reserved subgroup of a table group that holds every categories dataset for the table. See Section 8.7. The token CATEGORIES names both this group and the column attribute of the same name below; the two are unambiguous because one is an HDF5 link name and the other an attribute name, but producers should be aware of the overload.
SEARCH_INDEXES
The reserved subgroup of a table group that holds every search-index dataset for the table. See Search indexes.

Table group attribute names

CLASS
Identifies the group as a HEP001 table group. See Section 7.1.
VERSION
HEP001 revision the table conforms to.
NROWS
Scalar uint64. Number of logical rows currently in the table. See §7.3.
TITLE
Human-readable title of the table (optional).
INDEX_COLUMNS
A 1-D array attribute of HDF5 object references, whose elements point to the column datasets that serve as row labels for the table, in hierarchical order from outermost to innermost level. See Row index columns.

Column dataset attribute names

SEARCH_INDEX_LIST
Object references to the search-index datasets that accelerate queries on this column.
CATEGORIES
Object reference to the categories dataset, in the table group’s CATEGORIES subgroup, that backs a categorical column. Shares its token with the CATEGORIES group above.
valid_min, valid_max (lowercase, by exception)
Inclusive lower and upper bounds of the column’s logical value range. Each is a scalar attribute whose datatype matches the column’s element datatype. See §8.5.

Search-index and categories dataset attribute names

KIND
ASCII enum that identifies the family of a search-index dataset.
VALUES
Object reference, on a BITMAP search-index dataset, to its accompanying values dataset.

KIND attribute values

See Search indexes for their meaning. Consumers MUST treat unknown values as “ignore this search index”.

13.3Names shared with Anndata

The following attribute names and string values are borrowed from Anndata’s DataFrame layout (§15) and are written in lowercase, exactly as Anndata writes them, so that a producer targeting an Anndata converter does not have to case-fold. They are reserved for their Anndata-defined meaning and MUST NOT be repurposed:

Of these, HEP001 itself uses column-order, _index, and ordered (§7, §6); encoding-type, encoding-version, and the string values are optional pass-through names that HEP001 does not interpret. A producer MAY omit any of these names, but if it writes one at all it MUST use this exact form (lowercase, with the hyphens and underscore as shown).

14Worked examples

14.1A minimal table

A table of four columns — row_id (the row index), ts, energy, label — and no search indexes.

/my_table                          (Group)
  CLASS             = "COLUMN_TABLE"   (ASCII, fixed length)
  VERSION           = "1.0"            (ASCII, fixed length)
  NROWS             = N                (uint64, scalar)
  TITLE             = "Sample run"     (UTF-8)
  column-order      = ["row_id", "ts", "energy", "label"]  (UTF-8, 1-D)
  INDEX_COLUMNS     = [ref(row_id)]    (1-D object references)
  _index            = "row_id"         (UTF-8; primary row-label name)

/my_table/row_id                   (Dataset, uint64, shape (N,))
  description       = "Globally unique event identifier."

/my_table/ts                       (Dataset, int64, shape (N,))
  units             = "s"
  units_vocabulary  = "UDUNITS-2"
  description       = "Event timestamp."

/my_table/energy                   (Dataset, float32, shape (N,))
  units             = "MeV"

/my_table/label                    (Dataset, int8, shape (N,))
  description       = "Class label."
  CATEGORIES        = ref(CATEGORIES/label__CATEGORIES)

/my_table/CATEGORIES               (Group)

/my_table/CATEGORIES/label__CATEGORIES   (Dataset, vlen UTF-8, shape (3,))
  ordered           = false

14.2Adding a chunk min/max search index

Extending §14.1 with a CHUNK_MINMAX index on ts:

/my_table/ts
  SEARCH_INDEX_LIST = [ref(SEARCH_INDEXES/ts__chunk_minmax)]
  …

/my_table/SEARCH_INDEXES                       (Group)
/my_table/SEARCH_INDEXES/ts__chunk_minmax      (Dataset,
    compound {min: int64, max: int64, nan_count: uint64,
              fill_count: uint64, n: uint64}, shape (n_chunks,))
  KIND          = "CHUNK_MINMAX"

14.3A complete layout

15Relationship to Anndata

Anndata is the most direct inspiration for HEP001: HEP001 adopts the same group-of-one-dataset-per-column shape and deliberately reuses several of Anndata’s attribute names (column-order, _index, ordered) so that converters between the two are straightforward. HEP001 is not, however, a drop-in Anndata DataFrame format, and this document does not specify Anndata read/write conformance. The two layouts diverge on points that matter for typical tabular data — Anndata encodes categoricals and nullable columns as subgroups (with codes/categories or values/mask children) and tags every column array with an encoding-type, whereas HEP001 keeps each column as a single rank-1 dataset and records missing values through fill values. Because an HDF5 link resolves to one object, a column that Anndata stores as a subgroup cannot simultaneously be a HEP001 column dataset. A single group can therefore satisfy both specifications only for the restricted case of dense numeric and variable-length string columns with no categorical or nullable columns; anything richer requires an explicit import/export step. HEP001 reserves the Anndata-derived attribute names (§13.3) so producers may write them when targeting such a converter, but assigns them no HEP001 meaning.

16Security considerations

Search indexes are unsigned, untrusted derivative data. A consumer that trusts a table’s column data MUST NOT, by default, trust the correctness of a search index found in the same file: a tampered CHUNK_MINMAX can cause the consumer to skip chunks that do in fact satisfy a predicate. Consumers SHOULD offer a mode that verifies a search index against the column it covers, or that ignores search indexes entirely. Producers SHOULD document the provenance of search indexes in the table group’s description or in each search-index dataset’s description attribute when that matters to their users.

17References