Skip to content

Reading and Writing

Reading and writing AnnData objects.

Reading

Use anndata.io.read_h5ad to load a previously saved .h5ad file, or val.datasets.polis.load to import directly from a Polis conversation.

Writing

valency_anndata.write

write(
    filename: Path | str,
    adata: AnnData,
    *,
    include: Sequence[str] | None = None,
    ext: Literal["h5", "csv", "txt", "npz"] | None = None,
    compression: Literal["gzip", "lzf"] | None = "gzip",
    compression_opts: int | None = None,
) -> None

Write an AnnData object to file with automatic sanitization.

Wraps scanpy.write but first copies and sanitizes adata so that problematic fields (mixed-type uns["statements"] columns) do not cause serialization errors.

Parameters:

Name Type Description Default
filename Path | str

Output path. If the filename has no file extension it is interpreted the same way as scanpy.write.

required
adata AnnData

Annotated data matrix. Not mutated — a sanitized copy is written.

required
include Sequence[str] | None

When not None, only the listed "namespace/key" paths are kept in the written file. Glob patterns are supported (e.g. "obsm/X_*", "obs/kmeans_*"). Valid namespaces are obs, var, obsm, varm, layers, uns, obsp, and varp. X and raw are always retained.

None
ext Literal['h5', 'csv', 'txt', 'npz'] | None

File extension from which to infer file format.

None
compression Literal['gzip', 'lzf'] | None

See h5py dataset docs <https://docs.h5py.org/en/latest/high/dataset.html>_.

'gzip'
compression_opts int | None

See h5py dataset docs <https://docs.h5py.org/en/latest/high/dataset.html>_.

None
Notes

Cluster labels and missing values. Clustering columns (kmeans_*) are stored as categorical arrays <https://anndata.readthedocs.io/en/latest/fileformat-prose.html>_ in the h5ad file. The on-disk encoding uses integer codes that index into a categories array, with -1 reserved for missing entries. This means two distinct "absent" semantics survive the round-trip:

  • Label -1 (e.g. HDBSCAN noise points) is a real category in the categories array. It is a valid cluster assignment meaning "this participant was clustered but not assigned to any group."
  • NaN / pd.NA means the participant was never part of the clustering subset (e.g. excluded by mask_obs). On disk this is represented by code -1, which points to no category.

After reading the file back with :func:anndata.read_h5ad, you can distinguish the two with :func:pandas.isna::

labels = adata.obs["kmeans_polis"]
noise  = labels == -1     # clustered, but no group
unseen = labels.isna()    # not in the clustering subset

Examples:

Basic — write everything:

val.write("conversation.h5ad", adata)

Advanced — selectively include keys with glob patterns:

val.write(
    "export.h5ad",
    adata,
    include=["obsm/X_pca", "obsm/X_pacmap", "obs/kmeans_*", "uns/*"],
)
Source code in src/valency_anndata/_write.py
def write(
    filename: Path | str,
    adata: AnnData,
    *,
    include: Sequence[str] | None = None,
    ext: Literal["h5", "csv", "txt", "npz"] | None = None,
    compression: Literal["gzip", "lzf"] | None = "gzip",
    compression_opts: int | None = None,
) -> None:
    """Write an [AnnData][anndata.AnnData] object to file with automatic sanitization.

    Wraps [scanpy.write][] but first copies and sanitizes `adata` so that
    problematic fields (mixed-type ``uns["statements"]`` columns) do not
    cause serialization errors.

    Parameters
    ----------
    filename
        Output path.  If the filename has no file extension it is interpreted
        the same way as [scanpy.write][].
    adata
        Annotated data matrix.  **Not** mutated — a sanitized copy is written.
    include
        When not ``None``, only the listed ``"namespace/key"`` paths are kept
        in the written file.  Glob patterns are supported
        (e.g. ``"obsm/X_*"``, ``"obs/kmeans_*"``).  Valid namespaces are
        ``obs``, ``var``, ``obsm``, ``varm``, ``layers``, ``uns``, ``obsp``,
        and ``varp``.  ``X`` and ``raw`` are always retained.
    ext
        File extension from which to infer file format.
    compression
        See `h5py dataset docs <https://docs.h5py.org/en/latest/high/dataset.html>`_.
    compression_opts
        See `h5py dataset docs <https://docs.h5py.org/en/latest/high/dataset.html>`_.

    Notes
    -----
    **Cluster labels and missing values.**
    Clustering columns (``kmeans_*``) are stored as
    `categorical arrays <https://anndata.readthedocs.io/en/latest/fileformat-prose.html>`_
    in the h5ad file. The on-disk encoding uses integer *codes* that index
    into a *categories* array, with ``-1`` reserved for missing entries.
    This means two distinct "absent" semantics survive the round-trip:

    * **Label ``-1``** (e.g. HDBSCAN noise points) is a real category in
      the categories array. It is a valid cluster assignment meaning
      "this participant was clustered but not assigned to any group."
    * **``NaN`` / ``pd.NA``** means the participant was never part of the
      clustering subset (e.g. excluded by ``mask_obs``). On disk this is
      represented by code ``-1``, which points to no category.

    After reading the file back with :func:`anndata.read_h5ad`, you can
    distinguish the two with :func:`pandas.isna`::

        labels = adata.obs["kmeans_polis"]
        noise  = labels == -1     # clustered, but no group
        unseen = labels.isna()    # not in the clustering subset

    Examples
    --------
    Basic — write everything:

    ```py
    val.write("conversation.h5ad", adata)
    ```

    Advanced — selectively include keys with glob patterns:

    ```py
    val.write(
        "export.h5ad",
        adata,
        include=["obsm/X_pca", "obsm/X_pacmap", "obs/kmeans_*", "uns/*"],
    )
    ```
    """
    sanitized = _sanitize_for_export(adata)
    if include is not None:
        _filter_adata(sanitized, include)
    sc.write(
        filename,
        sanitized,
        ext=ext,
        compression=compression,
        compression_opts=compression_opts,
    )