1Introduction¶
HDF5 has stored tabular data from its earliest days, and several distinct idioms have grown up around that use case. HEP001 proposes a column-oriented storage layout, in the spirit of Apache Parquet, Apache Arrow, and Feather, that lives natively as an HDF5 group and combines cleanly with the multidimensional array datasets that are HDF5’s traditional strength.
1.1A short overview of tabular data in HDF5¶
The first adopted idiom is the HDF5 Table specification,
part of the HDF5 High-Level Library and implemented through its H5TB
API. A table is a single one-dimensional dataset whose datatype is an
HDF5 compound (record) type,
decorated with attributes such as CLASS="TABLE", VERSION, TITLE,
FIELD_0_NAME … FIELD_N_NAME, FIELD_0_FILL … FIELD_N_FILL, and NROWS.
Rows of the logical table become elements of the dataset; columns become
fields of the compound datatype. This layout is simple and portable, but it
is fundamentally row-oriented: every row occupies contiguous bytes, every
column shares the same chunking, and changing a single column’s datatype,
chunk shape, or compression filter requires rewriting the entire dataset.
The second influential idiom is PyTables, a Python
package that has layered a rich query engine on top of HDF5. PyTables likewise
stores a table as a single one-dimensional dataset of a compound type decorated
with its own CLASS="TABLE", VERSION, TITLE, FIELD_N_FILL, NROWS, and
PYTABLES_FORMAT_VERSION attributes. This adds a rich family of companion
structures for indexing, in-kernel queries, and compression with Blosc. PyTables
popularized the idea that a useful table format needs more than data bytes:
it needs indexes, metadata, and conventions that let tools reason about the
data.
The third and most recent influence is
Anndata, which treats a table (dataframe)
as an HDF5 group rather than a single compound dataset. Each column of the
dataframe is stored as its own one-dimensional dataset inside that group. The
group carries attributes that tell Anndata how to reassemble the columns into a
dataframe — encoding-type="dataframe", encoding-version, column-order (a
UTF-8 string array giving the column order), and _index (the name of the
column that supplies row labels). Anndata extends this convention with dedicated
encodings for nullable integers, nullable booleans, categoricals, sparse
matrices, and more. This layout is column-oriented, and it is the closest
existing HDF5 practice to what modern analytical engines expect.
1.2Why columnar, and why now¶
Row-oriented tables pack every value of every column together. That packing is ideal for use cases that scan whole rows (appending rows to a log, reading a few complete records by index) but it imposes three practical limits:
Uniform chunking and filtering. Every column in the table shares the same chunk shape and the same filter pipeline, because both are properties of the single compound dataset. A wide table that mixes a dense float column (well-served by shuffle + Zstd) with a high-cardinality string column (well-served by a dictionary encoding or Blosc bitshuffle) is forced to compromise.
Whole-row I/O for column queries. Selecting one column out of a hundred still reads every column’s bytes, because those bytes are interleaved within each chunk. Analytical workloads routinely scan a few columns of a wide table, and row orientation amplifies I/O proportional to row width.
Schema evolution. Adding or removing a column changes the compound datatype of the whole dataset, which in HDF5 terms means rewriting every chunk. Columnar layout makes schema evolution a matter of creating or deleting a sibling dataset.
Columnar formats such as Parquet, ORC, and Arrow Feather have become the lingua franca of analytical tabular data precisely because they decouple each column’s physical storage. HEP001 brings the same property to HDF5 while keeping everything the HDF5 ecosystem already has: a single container file, portable self-describing metadata, hierarchical groups, and lossless access to every tool in the HDF5 stack.
A decisive advantage of HDF5 over dedicated columnar formats is that tabular data does not have to live alone. For example, an HDF5 file can hold:
a table group with the experiment’s per-event observations,
a multidimensional dataset of raw sensor images addressed by those events,
a dense array of calibration coefficients indexed by detector channel,
and any number of nested groups carrying derived products.
Links, object references, and region references can tie rows of the table to slabs of the image cube without duplicating bytes. Analysts can query the table to identify events of interest and then dereference the rows into pixel-space regions in the same file, on the same storage, in a single API. Columnar tools and array tools meet in the middle.
1.3Scope and non-goals¶
HEP001 specifies:
the structure of a group that holds a column-oriented table,
the identifying attributes that let software recognize such a group,
how individual columns are represented as HDF5 datasets,
how row-label indexes relate columns and the table group,
how search indexes accelerate queries over column values, including a normative chunk-min/max index and framework definitions for sorted-row, bitmap, and per-chunk Bloom-filter indexes.
HEP001 does not specify:
a query language or execution engine,
semantic conventions for domain-specific column units (outside the
units/units_vocabularyattribute pair, which is descriptive only),how tables are committed, versioned, or synchronized — those are the province of the storage layer and, perhaps, future HEPs.
2Conformance¶
The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY in this document are to be interpreted as described in RFC 2119 and RFC 8174 when, and only when, they appear in all capitals.
A file, group, or dataset is HEP001-conformant when it satisfies every MUST in the section that applies to it. A producer is HEP001-conformant when every table group it writes is conformant; a consumer is conformant when it can read any conformant table group without data loss.
3Terminology¶
The following terms are used throughout this specification.
- Table group
- An HDF5 group that represents one column-oriented table. Identified by the
CLASSattribute (see §7.1).
- Dataset name
- The name of the HDF5 hard link that connects an HDF5 dataset with its parent HDF5 group.
- Column dataset
- An HDF5 dataset of rank 1 that is a direct child of a table group and
represents one column of the table. Datasets inside the reserved
CATEGORIESandSEARCH_INDEXESsubgroups are not column datasets. The dataset’s name is the column name. - Row index column
- A column dataset referenced by the table group’s
INDEX_COLUMNSattribute and which therefore supplies row labels for the table. Row index columns are otherwise indistinguishable from any other column dataset; the designation is made at the table-group level, not on the column itself. - Row
- A position
iin the half-open range[0, NROWS)(see §7.3) within every column dataset of the table group. Every column dataset MUST have the same first-dimension extent and that extent MUST be≥ NROWS, so the sameirefers to the same logical row everywhere. NROWS- The number of logical rows currently in the table. A scalar
uint64attribute on the table group, defined in §7.3. - Categories dataset
- An HDF5 dataset of rank 1 stored under the
CATEGORIESchild group of a table group that holds the label values backing one or more categorical columns. See Section 8.7. - Search index dataset
- An HDF5 dataset stored under the
SEARCH_INDEXESchild group of a table group that accelerates queries over one or more column datasets. Each kind of search index is specified in Search indexes.
4Data model overview¶
A HEP001 table is an HDF5 group whose direct children are the table’s
columns (one or more of which MAY be designated as row index columns via
the table group’s INDEX_COLUMNS attribute) and, optionally, two reserved
subgroups: CATEGORIES, holding the label datasets that back categorical
columns, and SEARCH_INDEXES, holding query-acceleration structures. The
table’s authoritative row count lives in the table group’s NROWS
attribute (see §7.3); every column dataset has
the same first-dimension extent, and rows [0, NROWS) are the table’s
data.
The rest of this document specifies each building block: the table group (The table group), column datasets (Column datasets), row index columns (Row index columns), and the four kinds of search-index dataset (Search indexes).
5Object references¶
Several HEP001 attributes link objects within a table group by HDF5 object
reference: INDEX_COLUMNS (§7), CATEGORIES
(§8.7), SEARCH_INDEX_LIST (§10),
and VALUES (§10.6). Every such reference MUST use the HDF5
standard reference datatype H5T_STD_REF (introduced in HDF5 1.12).
H5T_STD_REF is a unified datatype that can carry object, dataset-region, and
attribute references (so future HEP revisions can introduce finer-grained
linkages without a new on-disk reference format).
Producers MUST NOT write the deprecated object-reference datatype
H5T_STD_REF_OBJ. A consumer MAY reject, as non-conformant, any HEP001
reference attribute whose datatype is not H5T_STD_REF.
6Boolean attributes¶
HDF5 has no native boolean datatype, and the wider HDF5 ecosystem has not converged on one encoding. HEP001 fixes a single, self-describing form so that boolean attributes are unambiguous on disk.
Every attribute that this specification calls boolean MUST be stored as an HDF5 enumerated datatype with:
base type
H5T_STD_I8LE(signed 8-bit integer, little-endian), andexactly two members,
FALSEmapped to0andTRUEmapped to1.
In HDF5 DDL (as emitted by h5dump), the datatype is described as:
H5T_ENUM {
H5T_STD_I8LE;
"FALSE" 0;
"TRUE" 1;
}Producers MUST NOT store a HEP001 boolean as a plain integer, an H5T_BITFIELD,
or a string. A consumer determines truth from the enumerated integer value (0
= false, 1 = true).
7The table group¶
7.1Identification — the CLASS attribute¶
Every HEP001 table group MUST carry an attribute named CLASS with:
datatype: null-terminated 13-byte fixed-length ASCII string (exactly the length of the value below plus a NUL byte),
shape: scalar,
value:
COLUMN_TABLE.
A consumer MUST identify a group as a HEP001 table group by, and only by, the
presence of a CLASS attribute whose string value equals COLUMN_TABLE. A
producer MUST NOT write CLASS="COLUMN_TABLE" attribute on any group that does
not satisfy the rest of this specification.
7.2The VERSION attribute¶
Every HEP001 table group MUST carry a scalar, fixed-length ASCII attribute named
VERSION whose value is the HEP001 revision the table conforms to. Producers
MUST size the attribute to hold the value being written. For this revision the
value is "1.0".
HEP001 uses a two-component MAJOR.MINOR version with the major/minor semantics
of Semantic Versioning: MAJOR increments on a backward-incompatible
change to the data model (one an existing conformant consumer can no longer
read), and MINOR on a backward-compatible addition. HEP001 omits SemVer’s
third (PATCH) component deliberately since it would carry no actionable
meaning. A consumer that prefers a SemVer-style triple MAY read an absent
PATCH component as 0 — so "1.0" is equivalent to "1.0.0". Consumers MUST
compare VERSION values numerically and MUST refuse to process a table whose MAJOR
exceeds the highest MAJOR they implement.
7.3The NROWS attribute¶
Every HEP001 table group MUST carry a scalar NROWS attribute of datatype
uint64 whose value is the number of logical rows currently in the table.
NROWS is the table’s authoritative row count: rows [0, NROWS) of every
column dataset are part of the table, and rows [NROWS, extent), when
present, are reserved storage.
NROWS and a column dataset’s first-dimension extent are related but not
equal in general. Every column’s extent MUST be ≥ NROWS (see
§12), but it MAY exceed NROWS when a producer
has preallocated space for future appends. A consumer MUST determine the
table’s row count from NROWS and MUST NOT infer it from extent — doing so
would silently include preallocated slots or post-crash residue as if they
were table rows.
The attribute is borrowed from the long-established HDF5 Table and PyTables conventions, where it plays the same role. Centralizing the row count in one place — independent of any column’s storage state — gives producers a single-attribute commit point for atomic appends (see §11) and gives consumers a single place to look for the table’s size.
For a freshly created, empty table, NROWS = 0. Every column dataset’s
first-dimension extent MUST be ≥ 0 (a zero-length column is permitted
and is the natural form of a brand-new table).
7.4Optional table group attributes¶
The table group MAY carry the following attributes.
TITLE- Scalar fixed-length UTF-8 string. Human-readable title of the table, mirroring HDF5 Table and PyTables. Purely descriptive.
INDEX_COLUMNS- One-dimensional attribute of HDF5 object references whose elements point
to the column datasets that serve as row labels for this table, in
hierarchical order from outermost to innermost level. For example, a table
indexed by
(donor_id, sample_id, cell_id)writesINDEX_COLUMNS = [ref(donor_id), ref(sample_id), ref(cell_id)]. An empty array or absent attribute means the table has no row labels and rows are positional only. Every reference MUST resolve to a column dataset that is a direct child of the table group and MUST NOT be a null reference. Row index columns apply to the table as a whole — every column in the table is labeled by them. See Row index columns. column-order- One-dimensional fixed-length UTF-8 string attribute whose elements are
the names of the column datasets in their logical order. When present,
it fully determines the column order presented to users; when absent,
the logical column order is implementation-defined. Producers SHOULD
write
column-orderwhenever a table has more than one column. The attribute name uses a hyphen (not an underscore) to match Anndata. _index- Scalar fixed-length UTF-8 string. The dataset name of the column that
supplies the primary row labels for this table (for example,
"row_id"). The name is borrowed from Anndata’s_index. When bothINDEX_COLUMNSand_indexare present,_indexMUST equal the dataset name of the column referenced byINDEX_COLUMNS[0]. encoding-type/encoding-version- Scalar fixed-length UTF-8 strings, both optional. These names are borrowed
from Anndata’s DataFrame convention (§13.3). A
producer MAY set
encoding-type="dataframe"andencoding-version="0.2.0"if it separately wants Anndata software to attempt reading the group, but HEP001 neither requires nor interprets them and makes no guarantee that the result is a usable Anndata DataFrame (§15). description- Scalar fixed-length UTF-8 string. Free-text description of the table’s
contents. Longer and richer than
TITLE; intended for documentation viewers. units_vocabulary- Scalar fixed-length UTF-8 string. Names the vocabulary or authority that
interprets
unitsstrings on columns of this table — for example"UDUNITS-2","UCUM","QUDT", or a URL. When present on the table group, it applies as a default to every column whose ownunits_vocabularyis absent.
7.5Placement of the table group¶
The table group MAY be located anywhere in an HDF5 file’s hierarchy. In the simplest case the root group of the file is the table group — the file holds one table and nothing else — and all columns live at the top level. Tables MAY also be nested under named groups to organize many tables, or sited beside multidimensional array datasets that they describe.
7.6Self-contained contents¶
A table group is the complete representation of a single column-oriented table. Every HDF5 object — group, dataset, named datatype, or link — that is a descendant of the table group MUST exist solely in service of the data model defined here.
The only objects permitted anywhere in the HDF5 hierarchy below a table group are:
the column datasets of the table (Column datasets), which MAY include row index columns designated by the table group’s
INDEX_COLUMNSattribute (Row index columns);the reserved
CATEGORIESsubgroup and everything it contains (Section 8.7);the reserved
SEARCH_INDEXESsubgroup and everything it contains (Search indexes).
Any descendant of a table group that does not match one of the categories above is non-conformant. A producer MUST NOT place under a table group:
unrelated tables, including other HEP001 table groups;
multidimensional array datasets that are not derived from, or exclusively bound to, the columns of this table;
user-defined subgroups for provenance, schema, derived statistics, or any other auxiliary content not enumerated above.
Producers that wish to colocate such content with a table MUST do so by placing it as a sibling of the table group, or under an unrelated parent group elsewhere in the file. Object references, region references, and external links MAY be used to associate the table’s columns with that external content without violating this rule.
A consumer that encounters a descendant of a table group not described by this revision SHOULD emit a diagnostic identifying the offending HDF5 path. A consumer operating in strict-conformance mode MAY refuse to process the table; a consumer operating in lenient mode MAY ignore the offending descendant and continue.
Consumers MAY treat the table group as a closed, self-describing unit: copying, moving, renaming, deleting, or versioning the table group is guaranteed to act on the entire table and on nothing else.
8Column datasets¶
8.1Required properties¶
Each column of a HEP001 table MUST be stored as an HDF5 dataset that:
is a direct child of the table group,
has rank 1 shape,
has the same first-dimension extent as every other column dataset in the same table group, and that extent is
≥ NROWS(see §7.3 and §12).
The name of the HDF5 dataset is the column name. Any
name acceptable as an HDF5 link name (UTF-8, excluding / and NUL) is
permitted. Producers MUST NOT use any HEP001 reserved name
(Section 13.2) as a column name. Producers SHOULD also
avoid names that begin with an underscore, which are reserved for
Anndata-aligned attribute names (_index) and may be claimed by future HEPs.
8.2Column discovery¶
The only HDF5 datasets that are direct children of a table group are its column
datasets: categories datasets live one level down inside the CATEGORIES
subgroup (§8.7), and search-index datasets inside
the SEARCH_INDEXES subgroup (§10). A
consumer therefore enumerates a table’s columns as exactly the rank-1 datasets
that are direct children of the table group; the CATEGORIES and
SEARCH_INDEXES children are groups, not datasets, and are skipped
automatically.
When column-order (§7) is present it lists
exactly these column datasets and fixes their order; when absent, the set of
columns is still fully determined by this rule, and only their order is
implementation-defined.
8.3Datatypes¶
The datatype of a column dataset MAY be any HDF5 datatype. Consumers that encounter a datatype they do not recognize SHOULD expose the column’s raw datatype instead of quietly ignoring it.
Two caveats apply:
Compound datatypes at the column level are permitted, but a table group is not a mechanism for storing a nested row-oriented table. When a compound-typed column is used, its fields MUST be logically atomic.
Variable-length datatypes (e.g. variable-length UTF-8 strings, ragged numeric arrays) are permitted but generally discouraged, as their storage characteristics are poorly suited to columnar access patterns. Producers SHOULD prefer fixed-length equivalents where possible.
8.4Chunking and filters¶
Because each column is its own HDF5 dataset, each column MAY independently select:
chunk shape (typically
(N,)forNrows per column chunk),dataset creation properties such as fill value, allocation time, and track-times,
the filter pipeline.
This per-column flexibility is the core motivation for HEP001 and is normative: producers MUST NOT require columns to share chunk shape or filters. Consumers MUST treat each column’s storage layout independently.
8.5Missing values (fill values)¶
A column dataset’s HDF5 fill value identifies rows whose value is missing.
Producers MUST set the dataset’s fill value explicitly via the dataset creation
property list (H5Pset_fill_value), placing the dataset into the
H5D_FILL_VALUE_USER_DEFINED state, and MUST choose a fill value that lies
outside the column’s logical value range. A producer MAY declare that range
explicitly with the two attributes valid_min and valid_max on the column
dataset (each a scalar of the column’s element datatype). If present, the chosen
fill value MUST lie strictly outside [valid_min, valid_max].
Consumers MUST retrieve the fill value (H5Dget_fill_value) and identify
missing rows with the canonical missing-value test:
missing(value, fill_value) = isnan(fill_value) ? isnan(value) : value == fill_valueFor every column whose fill value is not a NaN bit pattern — including
every column whose datatype is not floating-point, and every column whose
producer chose any of the recommended fill values in
Table 1 — the isnan(fill_value) branch is always
false and the test reduces to ordinary bit-equality value == fill_value.
The same applies even to columns that contain no missing values — no row
satisfies the test, and the column simply has no missing rows.
For floating-point columns with a NaN bit pattern as the
fill value, the test reduces to isnan(value). IEEE 754 makes
NaN != NaN, so a literal value == fill_value test would miss every
fill-marked row in such columns; consumers MUST use the isnan(value)
branch instead. HEP001 does not recommend NaN as a fill value — it
conflates “the producer marked this row missing” with “the result of a
floating-point computation was indeterminate” — but it permits NaN for
producers who need a zero-cost round-trip with NaN-native ecosystems.
For producers that have no domain-specific constraint forcing a different choice, the table below lists recommended fill values for each datatype family.
Table 1:Recommended fill values
| HDF5 datatype | Recommended fill value | Hex (canonical bit pattern) |
|---|---|---|
int8 | -127 | 0x81 |
int16 | -32767 | 0x8001 |
int32 | -2147483647 | 0x80000001 |
int64 | -9223372036854775807 | 0x8000000000000001 |
uint8 | 255 | 0xFF |
uint16 | 65535 | 0xFFFF |
uint32 | 4294967295 | 0xFFFFFFFF |
uint64 | 18446744073709551615 | 0xFFFFFFFFFFFFFFFF |
float32 | 9.9692099683868690e+36 | 0x7CF00000 |
float64 | 9.9692099683868690e+36 | 0x479E000000000000 |
| fixed string | "" | n/a |
| vlen string | "" | n/a |
| enumeration | MISSING | n/a |
The recommended sentinels above are chosen as follows.
Integers. Signed-integer sentinels are INT*_MIN + 1 rather than
INT*_MIN itself, leaving a one-value safety margin against operations
that land on the type’s minimum (e.g., abs(INT*_MIN) is undefined
behavior in C). Unsigned sentinels are the type’s maximum.
Floating point. Two choices are acceptable:
The recommended non-NaN sentinel
9.9692099683868690e+36. Exactly representable in bothfloat32andfloat64, and preserves bit-identity under width casts, so the equality branch of the canonical missing-value test works without rounding-tolerance gymnastics. This is the recommended choice for new tables that do not need byte-level round-tripping with NaN-native ecosystems.Any NaN bit pattern (quiet, signaling, signed, with arbitrary payload). The canonical missing-value test then takes the
isnan(value)branch. This choice is permitted but not recommended on its own merits; producers should use it when zero-cost interoperability with pandas, Anndata, NumPy, R, or other NaN-as-missing ecosystems matters more than disambiguating “missing” from “result of an indeterminate computation”.
Strings. The empty string is the recommended sentinel because it is rarely a meaningful value in practice.
Enumerations. The recommended sentinel is a designated enum member
named MISSING. Producers using an enum column with missing values MUST
include such a member in the enum type definition; the integer code
backing the MISSING member then serves as the column’s fill value.
For datatypes not in the table:
float16is too narrow for a generic sentinel.Boolean (1-bit) cannot represent a missing sentinel alongside its two valid values. Producers MUST widen such columns to
uint8and use a value greater than1(typically2).
Producers whose column domain includes any of the recommended sentinels
above MUST choose a different fill value via H5Pset_fill_value. For
columns with a natural numeric range (integer and floating-point
columns), producers MUST also declare that range via valid_min and
valid_max and choose the fill outside it. For columns without a
natural numeric range (strings, opaque), the alternative fill value
alone constitutes the override; no additional attribute is required.
8.6Column attributes¶
A column dataset MAY carry any of the following attributes.
SEARCH_INDEX_LIST- One-dimensional array of HDF5 object references. Each reference points
to a search-index dataset (see §10) in the
SEARCH_INDEXESsubgroup that accelerates queries on this column. CATEGORIES- Scalar HDF5 object reference. Used only for categorical columns. Points to the dataset that holds the categorical values (see §8.7).
valid_min/valid_max- Scalar attributes of the same datatype as the column, specifying the minimum and maximum range of the column’s values. See §8.5.
units- Scalar fixed-length UTF-8 string. Physical units of the column’s values.
Absence of
unitsimplies dimensionless data. Units are interpreted according tounits_vocabulary(on the column, or inherited from the table group). units_vocabulary- Scalar fixed-length UTF-8 string. Identifies the units vocabulary that
interprets
units. When present on a column, it overrides the table group’sunits_vocabularyfor that column. MAY be a short name ("UDUNITS-2") or a URL. description- Scalar fixed-length UTF-8 string. Plain-text description of the column’s contents, provenance, or semantics.
8.7Categorical columns¶
A categorical column stores integer codes that index into a separate
categories dataset holding the actual label values. The code at row i
is the zero-based position of that row’s label in the categories dataset.
The CATEGORIES group¶
A table group MAY contain a direct child group named CATEGORIES. When
present, it MUST hold every categories dataset for the table, and no other
objects. It carries no required attributes of its own. Categories datasets
in CATEGORIES MAY have any name; the binding between a column and its
categories dataset is the column’s CATEGORIES object reference
(§5), never a parsed name.
Categorical column requirements¶
A categorical column MUST:
Have an integer datatype (any width, signed or unsigned). The column’s missing value (the dataset’s fill value) denotes a missing category.
Carry a scalar
CATEGORIESattribute whose value is an HDF5 object reference resolving to a categories dataset in the table group’sCATEGORIESsubgroup.
A categories dataset MUST:
Live directly under the table group’s
CATEGORIESsubgroup.Have rank 1, with any datatype appropriate to the label values.
A categories dataset MAY:
Carry a scalar boolean attribute
ordered(matching Anndata’s ordered categoricals), encoded per §6. Producers MUST setorderedto true exactly when the order of entries in the categories dataset is semantically meaningful.Back more than one categorical column. Several categorical columns that share a common label set MAY reference the same categories dataset through their
CATEGORIESattributes; producers need not duplicate a shared code book.
A categories dataset is not a column dataset: it does not count toward the
table’s columns, and it MUST NOT appear in the table group’s column-order
(§12).
9Row index columns¶
A row index column is a column dataset referenced by the table group’s
INDEX_COLUMNS attribute (see §7). Row index
columns supply row labels for the table as a whole — every column in the table
is labeled by every row index column. They are ordinary column datasets in every
other respect, and they SHOULD also appear in column-order like any other
column.
9.1Hierarchy¶
When INDEX_COLUMNS contains more than one reference, the order is the
row-label hierarchy from outermost to innermost level. For example,
INDEX_COLUMNS = [ref(donor_id), ref(sample_id), ref(cell_id)] declares
a three-level row index in which donor_id is the outermost grouping
and cell_id is the innermost row identifier.
9.2Typical uses¶
Single row index. The common case: one column (often a string of sample IDs, or an unsigned-integer ordinal) supplies row labels for the entire table.
Hierarchical row index. A small number of tables have a meaningful row-label hierarchy — for example, donor → sample → cell in single-cell genomics, or year → quarter → ticker in financial time series.
INDEX_COLUMNSlists the level columns in order.No row index. A table whose rows are identified solely by their position (the N-th row is “row N”) SHOULD omit
INDEX_COLUMNSentirely. Producers MAY equivalently writeINDEX_COLUMNSas an empty 1-D object-reference array; consumers MUST treat the two forms as semantically identical.
10Search indexes¶
Search indexes accelerate queries over column values. They do not change the logical table; they are derivative, recomputable data. A conformant consumer MAY ignore any or all search indexes and still return correct answers, only more slowly.
10.1The SEARCH_INDEXES group¶
A table group MAY contain a direct child group named SEARCH_INDEXES. When
present, it MUST hold every search-index dataset for the table, together with
any accompanying datasets those search indexes require and no other objects.
It carries no required attributes of its own. Datasets in SEARCH_INDEXES MAY
have any name.
A search-index dataset is distinguished from an accompanying dataset by the
KIND attribute (§10.3): every search-index
dataset MUST carry KIND, and an accompanying dataset MUST NOT carry KIND.
10.2Linking columns to search indexes¶
Each column dataset that benefits from a search index MUST reference
that search index from its own SEARCH_INDEX_LIST attribute — a 1-D
array of HDF5 object references to search-index datasets in the
SEARCH_INDEXES subgroup. The column-side attribute is the only
linkage; search-index datasets do not carry a back-pointer to the
columns they accelerate. To determine the column that a given
search-index dataset applies to, scan the column datasets of the table
group and identify the one whose SEARCH_INDEX_LIST references that
search-index dataset.
10.3Common per-index attributes¶
Every search-index dataset MUST carry a scalar fixed-length ASCII
attribute KIND whose value is one of the strings defined below:
KIND value | Purpose | Defined in |
|---|---|---|
CHUNK_MINMAX | Per-chunk min and max of an orderable column | §10.4 |
SORTED_ROWS | Row-position permutation of a column | §10.5 |
BITMAP | Per-value bitmap of a low-cardinality col. | §10.6 |
CHUNK_BLOOM | Per-chunk Bloom filter of a column | §10.7 |
Future HEPs MAY register additional KIND values. Consumers MUST treat
unknown KIND values as “ignore this search index”.
A search-index dataset MAY also carry a description attribute (scalar
fixed-length UTF-8 string), per the descriptive-annotation convention
used elsewhere in this spec (see Reserved names, rule 4).
Producers SHOULD use description to record provenance — the timestamp
of index construction, the producer software and version, and any
hyperparameters not captured by the index family’s own attributes.
Consumers MAY ignore description for query-execution purposes; it is
purely informational.
10.4Chunk min/max search index (KIND = CHUNK_MINMAX)¶
A chunk min/max index accelerates range and equality predicates over an
orderable column by letting the engine skip chunks whose value range does
not overlap the predicate. The column’s datatype MUST have a HEP001-defined
order — the same orders enumerated for SORTED_ROWS under Ordering in
§10.5 (integers, floating-point, boolean, strings,
opaque, and enumerations) — and min and max are computed under that
order. The datatypes that SORTED_ROWS excludes for lack of a defined
order (object and region references, compound, array, and
variable-length-array datatypes) likewise MUST NOT carry a CHUNK_MINMAX
index.
Shape: The search-index dataset is 1-D with length equal to the number of chunks of the source column dataset that contain logical-table rows:
ceil(NROWS / chunk_length)for a chunked column,1for a contiguous column (or0whenNROWS = 0).
Chunks lying entirely in the column’s preallocated tail
([NROWS, extent), see §7.3) are not indexed.
Datatype: An HDF5 compound datatype with the following fields, in declaration order:
min— same datatype as the source column’s element type.max— same datatype as the source column’s element type.nan_count—uint64. The number of IEEE 754 NaN values in the chunk, regardless of whether NaN is the column’s fill value. For non-floating-point columns this field MUST be present and set to 0.fill_count—uint64. The number of elements in the chunk that the canonical missing-value test (§8.5) classifies as missing. When the column’s fill value is itself a NaN bit pattern, every NaN element is missing by definition, andfill_countequalsnan_countfor that chunk; this overlap is intentional and consumers MAY rely onfill_countalone as the chunk’s missing-row count regardless of the column’s fill choice.n—uint64. The number of logical rows (i.e., rows in[0, NROWS); see §7.3) covered by this chunk. For every chunk other than the last,nequals the column’s chunk length; for the last data-bearing chunk,nMAY be strictly less than the chunk length whenNROWSis not a multiple of it.
Semantics: min and max are computed over only those elements of
the chunk that the canonical missing-value test (§8.5)
classifies as non-missing, and that are themselves orderable. For
floating-point columns this means: NaN values are excluded (because IEEE 754
does not order them, regardless of whether NaN is the column’s fill value),
and elements that match the column’s non-NaN fill value (if any) are
excluded. For integer and other orderable types, only elements equal to
the column’s fill value are excluded. When the chunk has no non-missing,
orderable elements (fill_count + nan_count == n for floating-point
columns, or fill_count == n for other orderable types), min and max
MUST be set to the column’s fill value.
Applicability: Each CHUNK_MINMAX search-index dataset applies to
exactly one column, identified by the column whose SEARCH_INDEX_LIST
references it. The column’s HDF5 datatype MUST have a HEP001-defined
order, as enumerated for SORTED_ROWS under Ordering
(§10.5); otherwise no CHUNK_MINMAX index is
permitted. A producer MAY build separate CHUNK_MINMAX indexes
for several columns, but a single search-index dataset MUST NOT cover
multiple columns because chunks of different columns are independent.
Additional attributes on the search-index dataset:
KIND—"CHUNK_MINMAX"(see §10.3).
All other per-index data lives in the compound datatype’s fields
(min, max, nan_count, fill_count, n); these are HDF5 datatype
fields, not HDF5 attributes.
10.5Sorted-row permutation index (KIND = SORTED_ROWS)¶
A sorted-row index stores row positions in sorted order of a column’s values, enabling binary search and range scans without reading the full column.
Shape: 1-D, length equal to NROWS (see §7.3).
Rows of the source column lying in the preallocated tail
[NROWS, extent) are not part of the permutation.
Datatype: An unsigned integer wide enough to address every row of the
source column (typically uint64).
Semantics: Element i of the index is the row position r such
that, under the ordering defined below, the i-th rank of the column’s
values lives at row r. Ties between rows with identical values MUST
be broken by increasing r, so the permutation is total and
deterministic.
Ordering: A SORTED_ROWS index MUST only be built over a column
whose HDF5 datatype has a HEP001-defined order, as enumerated below:
Signed and unsigned integers (any width): standard arithmetic order.
Floating-point values (
float16,float32,float64): IEEE 754 numerical order over finite values and the two infinities. Negative zero MUST compare equal to positive zero.NaNvalues are not ordered; rows whose value isNaNMUST be placed at the end of the permutation, in increasingrorder, and thenan_tail_lengthattribute MUST record the count. The NaN tail and the fill tail are always disjoint by construction: a row is placed in the NaN tail if its value isNaN, and otherwise in the fill tail if its value equals a non-NaNfill. When the column’s fill value is itselfNaN, every missing row is aNaNrow and is placed in theNaNtail;fill_tail_lengthis then0, andnan_tail_lengthequals the column’s total missing-row count.Boolean values:
false(0x00) sorts beforetrue(0x01).Fixed- and variable-length strings: lexicographic comparison over the UTF-8 byte sequence — i.e., byte-wise, with no byte-order mark and no Unicode normalization (no NFC, NFD, NFKC, or NFKD conversion is applied). For HDF5 fixed-length strings, the column’s storage-padding bytes — NUL bytes (
0x00) forH5T_STR_NULLTERMandH5T_STR_NULLPAD, space bytes (0x20) forH5T_STR_SPACEPAD— MUST be stripped from the trailing end of each string before comparison, so that the same logical string sorts identically whether stored fixed- or variable-length. Strings whose declared HDF5 character set isH5T_CSET_ASCIIMUST be ordered as if they were UTF-8 (ASCII is a strict subset). HEP001 does not specify locale-sensitive collation (e.g., POSIXstrcoll, ICU UCA); byte-wise comparison is the only conformant rule because it is deterministic across implementations, platforms, and runtime locales.Opaque (
H5T_OPAQUE) values: byte-wise lexicographic comparison over the raw bytes of the value; the opaque tag, if any, is not part of the comparison.Enum datatypes: ordered by the underlying integer codes, using the integer rule above.
The following datatypes do not have a defined sorting order, so a
SORTED_ROWS index MUST NOT be built on them:
HDF5 object and region references (no canonical order over references);
compound datatypes (a future HEP MAY register a canonical multi-field ordering);
array and variable-length-array datatypes.
Rows whose value matches a non-NaN column fill value under the
canonical missing-value test (§8.5) MUST appear
immediately before the NaN tail (if any), in increasing r order,
and the fill_tail_length attribute MUST record the count. For
datatypes that have no NaN concept, the NaN tail is empty
(nan_tail_length = 0) and the fill-tail rows appear at the very end
of the permutation. When the column’s fill value is itself NaN, all
missing rows are NaN rows and are placed in the NaN tail (see the
floating-point ordering rule above); the fill tail is then empty
(fill_tail_length = 0).
Additional attributes:
KIND—"SORTED_ROWS".nan_tail_length,fill_tail_length— scalaruint64. Both MUST be present; either MAY be 0.ordered— scalar boolean (§6). MUST be true forSORTED_ROWS; reserved for future indexes that permit partial orderings.
Applicability: Each SORTED_ROWS search-index dataset applies to
exactly one column, identified by the column whose SEARCH_INDEX_LIST
references it. The column’s HDF5 datatype MUST have a HEP001-defined
order, as enumerated under Ordering above; otherwise no
SORTED_ROWS index is permitted.
10.6Bitmap index (KIND = BITMAP)¶
A bitmap index accelerates equality predicates on low-cardinality columns.
Shape: 2-D of shape (K, ceil(NROWS / 8)), where K is the number of
distinct values (or categories) indexed and NROWS is the table’s row
count (see §7.3). Rows of the source column
lying in the preallocated tail [NROWS, extent) are not indexed.
Datatype: uint8. Bit r % 8 (where bit 0 is the byte’s least
significant bit) of byte r / 8 of row k is set if the column’s value
at row r equals the k-th indexed value. Because the storage element
is uint8, HDF5 performs no byte swapping on read or write — the bytes
on disk are exactly the bytes the producer wrote.
Accompanying values dataset: A sibling 1-D dataset, under SEARCH_INDEXES,
holds the K indexed values in the same datatype as the source column. Its name
is linked from the bitmap via a scalar VALUES object-reference attribute. This
values dataset is an accompanying dataset, not a search-index dataset: it MUST
NOT carry a KIND attribute (see §10).
Additional attributes on the bitmap dataset:
KIND—"BITMAP".VALUES— scalar HDF5 object reference to the values dataset.ordered— scalar boolean (§6). Whentrue, the entries in the values dataset (linked viaVALUES) are listed in a semantically meaningful order, for example, a numerically-sorted set of distinct values or an ordinal category sequence such as["low", "medium", "high"]. The bitmap’sk-th row corresponds to thek-th entry of the values dataset under that order. Whenfalseor absent, the order of the values dataset is arbitrary (typically insertion order) and consumers MUST NOT infer any semantic ordering from it.
Applicability: Each BITMAP search-index dataset applies to
exactly one column, identified by the column whose SEARCH_INDEX_LIST
references it.
10.7Per-chunk Bloom filter index (KIND = CHUNK_BLOOM)¶
A per-chunk Bloom filter accelerates equality predicates on high-cardinality columns by giving a fast negative answer for chunks that provably do not contain the queried value.
Shape: 2-D of shape (n_chunks, m_bytes), where n_chunks is the
number of chunks of the source column that contain logical-table rows
(ceil(NROWS / chunk_length) for a chunked column, 1 for a contiguous
column, or 0 when NROWS = 0; see §7.3) and
m_bytes is the Bloom-filter byte length per chunk (constant across
chunks). Chunks lying entirely in the column’s preallocated tail
[NROWS, extent) are not indexed.
Datatype: uint8. Each row is the packed bit array of one chunk’s
Bloom filter. Because the storage element is uint8, HDF5 performs no
byte swapping on read or write — the bytes on disk are exactly the
bytes the producer wrote.
Bit packing: Filter bit g (where 0 ≤ g < m_bits) is stored at
bit position g % 8 (where bit 0 is the byte’s least significant bit)
of byte g / 8 of the chunk’s row. Equivalently, setting filter bit
g is row[g >> 3] |= (uint8_t)1 << (g & 7). m_bits MUST be a
multiple of 8, so that m_bytes = m_bits / 8 is exact.
Hash scheme: HEP001 prescribes — for interoperability — Kirsch–Mitzenmacher
double hashing h_i(x) = (h_a(x) + i * h_b(x)) mod (8 * m_bytes) for i = 0 … k − 1. Both h_a and h_b come from a single invocation of
MurmurHash3_x64_128 (the 128-bit, 64-bit-tuned variant of MurmurHash3) over
the value’s canonical byte representation (see below), seeded with the value
of the seed attribute. The low 64 bits of the 128-bit output become h_a; the
high 64 bits become h_b. The number of hash functions k is stored as an
attribute.
Canonical byte representation: Bloom-filter hashes MUST be computed over a column value’s canonical byte representation, defined here. The canonical form is independent of how the column is stored on disk: a column written in big-endian MUST be transcoded to its canonical form before hashing, and a consumer evaluating a query MUST canonicalize the query value identically before testing the filter.
Signed integers (
int8,int16,int32,int64): the value’s two’s-complement representation in the column’s storage width, packed little-endian. Anint8is one byte; anint64is exactly eight.Unsigned integers (
uint8…uint64): the value’s unsigned representation in the column’s storage width, packed little-endian.Floating-point values (
float16,float32,float64): the value’s IEEE 754 binary encoding at the column’s storage width, packed little-endian.NaNvalues MUST NOT be inserted into a Bloom filter — differentNaNbit patterns would yield different filter bits and break interoperability — and consumers MUST NOT query forNaNagainst aCHUNK_BLOOMindex. The rule applies whetherNaNis a real-data value in the column or the column’s chosen fill value: in the latter case,NaNelements are missing values, and missing values are by convention excluded from Bloom-filter membership testing anyway. Producers MUST normalize negative zero to positive zero before hashing.Boolean values: a single byte,
0x00for false or0x01for true.Fixed- and variable-length strings: the string’s UTF-8 byte sequence, with no byte-order mark and no Unicode normalization (no NFC, NFD, NFKC, or NFKD conversion is applied). For HDF5 fixed-length strings, the column’s storage-padding bytes — NUL bytes (
0x00) forH5T_STR_NULLTERMandH5T_STR_NULLPAD, space bytes (0x20) forH5T_STR_SPACEPAD— MUST be stripped from the trailing end of each string before hashing, so that the same logical string hashes identically regardless of the column’s declared padding mode. Strings whose declared HDF5 character set isH5T_CSET_ASCIIMUST be hashed as if they were UTF-8 (ASCII is a strict subset).Opaque (
H5T_OPAQUE) values: the raw bytes of the value in column order; the opaque tag, if any, is not part of the hashed bytes.HDF5 object and region references: out of scope for this revision. Producers MUST NOT build a
CHUNK_BLOOMindex over a reference-typed column.Compound, enum, array, and variable-length-array datatypes: out of scope for this revision. Producers MUST NOT build a
CHUNK_BLOOMindex over composite-typed columns. A future HEP MAY register canonical encodings for these cases.
Additional attributes:
KIND—"CHUNK_BLOOM".k— scalaruint16. The number of hash functions.m_bits— scalaruint64. Equal to8 * m_bytes; stored explicitly for clarity.hash_family— scalar fixed-length ASCII string; for this revision MUST be"murmur3_x64_128_double".seed— scalaruint32. The seed passed toMurmurHash3_x64_128, whose reference implementation accepts a 32-bit seed. Default0. Producers MAY change the seed to mitigate adversarial inputs; consumers MUST read the seed from this attribute rather than assuming0.
Applicability: Each CHUNK_BLOOM search-index dataset applies to
exactly one column, identified by the column whose SEARCH_INDEX_LIST
references it. The column’s HDF5 datatype MUST belong to one of the
kinds enumerated under Canonical byte representation; otherwise no
CHUNK_BLOOM index is permitted.
11Writing and appending data¶
This section specifies how producers add, remove, or rewrite rows of a
HEP001 table and how consumers interpret the table during and after
those operations. The contract is anchored on the table group’s NROWS
attribute (§7.3).
11.1How consumers interpret NROWS¶
NROWS is the authoritative count of rows currently in the table.
A consumer MUST treat rows [0, NROWS) of every column dataset as
the table’s data and MUST ignore rows [NROWS, extent) of any
column dataset, even when those rows hold values that are not the
column’s fill value. The tail is reserved storage, not data; its
contents have no semantic meaning under HEP001 and MAY contain
arbitrary bytes left behind by a previous write, a previous
truncation, or HDF5’s own fill-value mechanism.
The same cutoff applies to search indexes. A consumer MUST consult
only those index entries that describe rows or chunks within
[0, NROWS). Tail entries — for example, a CHUNK_MINMAX row
describing a chunk that lies entirely in [NROWS, extent) — MAY be
present as residue from preallocation or from a previous, larger table
state, and MUST be ignored.
11.2Appending rows¶
A producer that appends K new rows to a table MUST perform the
operation in the following order, treating step 5 as the single commit
point that publishes the new rows to readers:
Read the current
NROWS(call itN_old).Extend every column dataset so that its first-dimension extent is
≥ N_old + K. A column that already has spare capacity from a prior preallocation needs no extension; the requirement is the post-conditionextent ≥ N_old + Kon every column.Write the new row values into rows
[N_old, N_old + K)of each column. Writes MAY proceed in any order across columns.Update every search index on every affected column so that, after step 5 commits, the index correctly describes rows
[0, N_old + K). For index families that support efficient incremental updates (CHUNK_MINMAX,CHUNK_BLOOM), a producer MAY simply append new entries; for families that do not (SORTED_ROWS,BITMAP), the producer typically rebuilds the index from scratch. A producer that cannot perform a consistent index update MUST delete the affected indexes — and remove the corresponding references from each column’sSEARCH_INDEX_LIST— before step 5.Commit by writing
NROWS = N_old + Kas the final step. The attribute update SHOULD be followed by anH5Fflush(or equivalent) before the producer reports the append as complete.
The ordering is load-bearing for crash recovery. Until step 5 commits,
every consumer that opens the file still sees NROWS = N_old and
therefore the table exactly as it was before the append. A producer
that crashes anywhere in steps 1–4 leaves a file whose observable
state is identical to the pre-append state: the unused tail in
[N_old, extent) is reserved storage, and any index residue beyond
N_old is ignored. No cleanup is required for readers to use the
file correctly. A subsequent producer that wants to retry the append
SHOULD either overwrite the unused tail or extend further.
HDF5 itself does not guarantee atomic ordering between the data writes
of step 3, the index writes of step 4, and the attribute update of
step 5 unless the producer issues explicit H5Fflush calls between
them. A producer that requires strong durability
across a crash MUST issue an H5Fflush after step 4 and again after
step 5, so that the on-disk state cannot show NROWS = N_old + K
together with unwritten column data or stale indexes.
11.3Preallocation¶
A producer MAY extend column datasets past the current NROWS to
amortize the cost of H5Dset_extent across many small appends — for
example, extending by one chunk’s worth of rows at a time and filling
that chunk over several append batches. The extended tail is reserved
storage; its contents have no semantic meaning under HEP001 until a
subsequent commit (step 5 above) increases NROWS to cover them.
A producer that preallocates SHOULD ensure that all columns in the same table group remain at equal first-dimension extents after each operation (see §12). The simplest and recommended discipline is to preallocate every column by the same number of rows at the same time.
11.4Truncation¶
A producer MAY shrink the logical table by writing a smaller NROWS
value. The truncation is logical: the column datasets MAY retain their
old extents, with rows [new_NROWS, old_NROWS) becoming reserved
storage. A producer that wants to reclaim physical space MUST rewrite
each column dataset to its new extent — typically via the h5repack
utility or an equivalent rewriting tool.
The same index-consistency rule that governs appends applies on
truncation: the producer MUST update every affected search index to
match the new NROWS, or delete it (and remove the corresponding
SEARCH_INDEX_LIST entry on the column) before committing the new
NROWS value.
11.5In-place updates¶
A producer MAY rewrite individual cells of the table — change the
value at one or more row positions < NROWS — without altering
NROWS. Producers MUST update every affected search index to reflect
the new values, or delete it before committing the change. In-place
updates do not benefit from the single-attribute commit point that
NROWS provides; producers that require atomic semantics for
in-place edits MUST arrange them externally (for example, by writing
to a fresh column dataset and swapping it in under a future revision
of this HEP, or by using application-level coordination outside HDF5).
12Consistency requirements¶
A conformant table group satisfies all of the following at all times:
The table group MUST carry a scalar
NROWSattribute of datatypeuint64(see §7.3).Every column dataset (including row index columns) in the same table group MUST have the same first-dimension extent, and that extent MUST be
≥ NROWS.Every search-index dataset in the
SEARCH_INDEXESgroup MUST carry aKINDattribute. The only other datasets permitted in theSEARCH_INDEXESgroup are the accompanying datasets those search indexes require, and an accompanying dataset MUST NOT carry aKINDattribute.Every reference in a column’s
SEARCH_INDEX_LISTMUST resolve to a search-index dataset under the table group’sSEARCH_INDEXESsubgroup.Every categorical column’s
CATEGORIESreference MUST resolve to a categories dataset in the table group’sCATEGORIESsubgroup (§8.7). TheCATEGORIESsubgroup, when present, MUST contain only categories datasets, and every categories dataset MUST be referenced by at least one categorical column’sCATEGORIESattribute.column-order, when present, MUST list every column dataset of the table exactly once and MUST NOT list any dataset that is not a column dataset (in particular, not a categories dataset or a search-index dataset).Every reference in the table group’s
INDEX_COLUMNSattribute (when present) MUST resolve to a column dataset that is a direct child of the table group and MUST NOT be a null reference; when_indexis also present, it MUST equal the dataset name of the column referenced byINDEX_COLUMNS[0].Every categorical column’s fill value (see §8.5) MUST NOT collide with a valid integer code in the linked categories dataset’s index range, so that the canonical missing-value test (§8.5) unambiguously denotes “missing category” rather than “valid value at category index fill.”
Every search-index dataset’s content MUST correctly describe its source column for row positions in
[0, NROWS). Tail entries that describe positions≥ NROWS— for example, residue from preallocation or from a prior truncation — MAY be present and MUST be ignored by consumers.
A producer that mutates a table (appends rows, truncates, rewrites a
column, etc.) MUST follow §11 — either
updating the affected search indexes consistently or deleting them
before committing the mutation through NROWS.
13Reserved names¶
HEP001 follows the long-standing HDF Group High-Level API practice — established by the HDF5 Table, Image, and Dimension Scales specifications — of writing reserved attribute and group names in fixed-length ASCII, UPPERCASE. The intent is that a reader scanning an HDF5 file can tell at a glance which names belong to the specification and which were chosen by the producer of the data.
13.1Naming rules¶
Every attribute or group name introduced as part of the HEP001 specification MUST be written in fixed-length uppercase ASCII, with underscores as the only word separator (for example,
CLASS,INDEX_COLUMNS,SEARCH_INDEXES).Producers MUST NOT use any reserved name listed in Section 13.2 for a column dataset, a search-index dataset, a user-supplied attribute, or any other purpose other than the one this HEP assigns to it.
Names that HEP001 deliberately borrows from other ecosystems, currently only from Anndata’s DataFrame layout (§15), are exempt from rule 1 and MUST be written exactly as those ecosystems write them. They are listed in Section 13.3.
Several attributes that align with broader scientific HDF5 community practice are exempt from rule 1 and MUST be written in lowercase, so that generic metadata harvesters and existing tools can discover them on a HEP001 table without case-folding.
The descriptive annotation attributes
units,units_vocabulary, anddescriptionare lowercase and carry no contractual meaning: their presence, absence, or value does not change how a HEP001 consumer interprets the table or any of its objects. They are defined alongside the objects that may carry them (The table group, Column datasets) and do not appear in the reserved-name catalog. Any future descriptive annotation attribute introduced by this HEP or a successor MUST follow the same lowercase convention.The value-domain attributes
valid_minandvalid_max, when present on a column dataset, carry contractual meaning: the column’s fill value MUST lie strictly outside[valid_min, valid_max](see §8.5). They appear in the reserved-name catalog (Section 13.2) despite being lowercase.Per-search-index attributes that are private to a specific search-index family — including configuration parameters, declarations of the algorithm used, and computed output metrics — are not part of the reserved name contract and are written in lowercase
snake_case. They are documented with the index family that defines them (Search indexes).KIND values (the string contents of the
KINDattribute) are themselves reserved tokens and follow the same uppercase rule as reserved names.
13.2Reserved name catalog¶
The complete set of HEP001 reserved names is listed below.
Group names¶
CATEGORIES- The reserved subgroup of a table group that holds every categories dataset
for the table. See Section 8.7. The token
CATEGORIESnames both this group and the column attribute of the same name below; the two are unambiguous because one is an HDF5 link name and the other an attribute name, but producers should be aware of the overload. SEARCH_INDEXES- The reserved subgroup of a table group that holds every search-index dataset for the table. See Search indexes.
Table group attribute names¶
CLASS- Identifies the group as a HEP001 table group. See Section 7.1.
VERSION- HEP001 revision the table conforms to.
NROWS- Scalar
uint64. Number of logical rows currently in the table. See §7.3. TITLE- Human-readable title of the table (optional).
INDEX_COLUMNS- A 1-D array attribute of HDF5 object references, whose elements point to the column datasets that serve as row labels for the table, in hierarchical order from outermost to innermost level. See Row index columns.
Column dataset attribute names¶
SEARCH_INDEX_LIST- Object references to the search-index datasets that accelerate queries on this column.
CATEGORIES- Object reference to the categories dataset, in the table group’s
CATEGORIESsubgroup, that backs a categorical column. Shares its token with theCATEGORIESgroup above. valid_min,valid_max(lowercase, by exception)- Inclusive lower and upper bounds of the column’s logical value range. Each is a scalar attribute whose datatype matches the column’s element datatype. See §8.5.
Search-index and categories dataset attribute names¶
KIND- ASCII enum that identifies the family of a search-index dataset.
VALUES- Object reference, on a
BITMAPsearch-index dataset, to its accompanying values dataset.
KIND attribute values¶
CHUNK_MINMAXSORTED_ROWSBITMAPCHUNK_BLOOM
See Search indexes for their meaning. Consumers MUST treat unknown values as “ignore this search index”.
13.3Names shared with Anndata¶
The following attribute names and string values are borrowed from Anndata’s DataFrame layout (§15) and are written in lowercase, exactly as Anndata writes them, so that a producer targeting an Anndata converter does not have to case-fold. They are reserved for their Anndata-defined meaning and MUST NOT be repurposed:
attribute names —
column-order,_index,encoding-type,encoding-version,ordered;string values —
"dataframe"and"categorical"when written into anencoding-typeattribute.
Of these, HEP001 itself uses column-order, _index, and ordered
(§7, §6);
encoding-type, encoding-version, and the string values are optional
pass-through names that HEP001 does not interpret. A producer MAY omit any of
these names, but if it writes one at all it MUST use this exact form
(lowercase, with the hyphens and underscore as shown).
14Worked examples¶
14.1A minimal table¶
A table of four columns — row_id (the row index), ts, energy, label —
and no search indexes.
/my_table (Group)
CLASS = "COLUMN_TABLE" (ASCII, fixed length)
VERSION = "1.0" (ASCII, fixed length)
NROWS = N (uint64, scalar)
TITLE = "Sample run" (UTF-8)
column-order = ["row_id", "ts", "energy", "label"] (UTF-8, 1-D)
INDEX_COLUMNS = [ref(row_id)] (1-D object references)
_index = "row_id" (UTF-8; primary row-label name)
/my_table/row_id (Dataset, uint64, shape (N,))
description = "Globally unique event identifier."
/my_table/ts (Dataset, int64, shape (N,))
units = "s"
units_vocabulary = "UDUNITS-2"
description = "Event timestamp."
/my_table/energy (Dataset, float32, shape (N,))
units = "MeV"
/my_table/label (Dataset, int8, shape (N,))
description = "Class label."
CATEGORIES = ref(CATEGORIES/label__CATEGORIES)
/my_table/CATEGORIES (Group)
/my_table/CATEGORIES/label__CATEGORIES (Dataset, vlen UTF-8, shape (3,))
ordered = false14.2Adding a chunk min/max search index¶
Extending §14.1 with a CHUNK_MINMAX index on ts:
/my_table/ts
SEARCH_INDEX_LIST = [ref(SEARCH_INDEXES/ts__chunk_minmax)]
…
/my_table/SEARCH_INDEXES (Group)
/my_table/SEARCH_INDEXES/ts__chunk_minmax (Dataset,
compound {min: int64, max: int64, nan_count: uint64,
fill_count: uint64, n: uint64}, shape (n_chunks,))
KIND = "CHUNK_MINMAX"14.3A complete layout¶
15Relationship to Anndata¶
Anndata is the most direct inspiration for HEP001: HEP001 adopts the same
group-of-one-dataset-per-column shape and deliberately reuses several of
Anndata’s attribute names (column-order, _index, ordered) so that
converters between the two are straightforward. HEP001 is not, however, a
drop-in Anndata DataFrame format, and this document does not specify
Anndata read/write conformance. The two layouts diverge on points that
matter for typical tabular data — Anndata encodes categoricals and nullable
columns as subgroups (with codes/categories or values/mask
children) and tags every column array with an encoding-type, whereas
HEP001 keeps each column as a single rank-1 dataset and records missing
values through fill values. Because an HDF5 link resolves to one object,
a column that Anndata stores as a subgroup cannot simultaneously be a HEP001
column dataset. A single group can therefore satisfy both specifications
only for the restricted case of dense numeric and variable-length string
columns with no categorical or nullable columns; anything richer requires an
explicit import/export step. HEP001 reserves the Anndata-derived attribute
names (§13.3) so producers may write them when
targeting such a converter, but assigns them no HEP001 meaning.
16Security considerations¶
Search indexes are unsigned, untrusted derivative data. A consumer that
trusts a table’s column data MUST NOT, by default, trust the correctness
of a search index found in the same file: a tampered CHUNK_MINMAX can
cause the consumer to skip chunks that do in fact satisfy a predicate.
Consumers SHOULD offer a mode that verifies a search index against the
column it covers, or that ignores search indexes entirely. Producers
SHOULD document the provenance of search indexes in the table group’s
description or in each search-index dataset’s description attribute
when that matters to their users.
17References¶
HDF5 file format specification — The HDF Group. https://
docs .hdfgroup .org /hdf5 /develop / _f _m _t3 .html HDF5 Table specification — HDF5 High-Level Library, The HDF Group. https://
support .hdfgroup .org /documentation /hdf5 /latest / _t _b _l _s _p _e _c .html PyTables File Format — PyTables Users’ Guide. https://
www .pytables .org /usersguide /file _format .html Anndata on-disk format (DataFrames) — Anndata documentation. https://
anndata .readthedocs .io /en /stable /fileformat -prose .html #dataframes Apache Parquet format specification. https://
parquet .apache .org /docs /file -format/ Apache Arrow columnar format. https://
arrow .apache .org /docs /format /Columnar .html RFC 2119 — Key words for use in RFCs to Indicate Requirement Levels. S. Bradner, 1997. https://
datatracker .ietf .org /doc /html /rfc2119 RFC 8174 — Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. B. Leiba, 2017. https://
datatracker .ietf .org /doc /html /rfc8174 IEEE Std 754-2019 — IEEE Standard for Floating-Point Arithmetic. IEEE, 2019. DOI: 10.1109/IEEESTD.2019.8766229
Semantic Versioning 2.0.0 — T. Preston-Werner. https://
semver .org /spec /v2 .0 .0 .html Unicode Standard Annex #15 — Unicode Normalization Forms. The Unicode Consortium. https://
www .unicode .org /reports /tr15/ MurmurHash3 — A. Appleby, SMHasher project. https://
github .com /aappleby /smhasher /wiki /MurmurHash3 A. Kirsch and M. Mitzenmacher, “Less Hashing, Same Performance: Building a Better Bloom Filter,” in Algorithms — ESA 2006 (LNCS 4168), pp. 456–467, 2006. DOI: 10.1007/11841036_42