Tools

`valency-anndata` methods¶

valency_anndata.tools.recipe_polis ¶

recipe_polis(
    adata: AnnData,
    *,
    participant_vote_threshold: int = 7,
    key_added_pca: str = "X_pca_polis",
    key_added_kmeans: str = "kmeans_polis",
    mask_var: str | None = None,
    inplace: bool = True,
)

Projects and clusters participants as of [Small et al., 2021].

Expects sparse vote matrix .X with {+1, 0, -1} and NaN values.

Recipe Steps

Masks out meta and moderated-out statements with zeros.
Imputes missing matrix votes with statement-wise means.
Runs standard PCA on the imputed matrix.
Runs sparsity-aware scaling on PCA projections.
Calculates a participant mask using 7-vote threshold.
On unmasked rows, calculates k-means clustering for 2 ≤ k ≤ 5, selecting the optimal k via silhouette scores.

Parameters:

Name	Type	Description	Default
`participant_vote_threshold`	`int`	Vote threshold at which each participant will be included in clustering.	`7`
`key_added_pca`	`str`	If not specified, the PCA embedding is stored as `.obsm['X_pca_polis']`, the loadings as `.varm['X_pca_polis']`, and the PCA parameters in `.uns['X_pca_polis']`. If specified, all are stored instead at `[key_added_pca]`.	`'X_pca_polis'`
`key_added_kmeans`	`str`	`.obs` key under which to add the cluster labels.	`'kmeans_polis'`
`mask_var`	`str \| None`	Column name in `adata.var` to use for masking statements before PCA. If provided, only statements where `mask_var` is True will be used. If None, uses all statements.	`None`
`inplace`	`bool`	Perform computation inplace or return result.	`True`

Returns:

Type	Description
`.obsm['X_pca_polis' \| key_added]`	PCA representation of data.
`.varm['X_pca_polis' \| key_added]`	The principal components containing the loadings.
`.uns['X_pca_polis' \| key_added]['variance_ratio']`	Ratio of explained variance.
`.uns['X_pca_polis' \| key_added]['variance']`	Explained variance, equivalent to the eigenvalues of the covariance matrix.
`.obs['kmeans_polis' \| key_added]`	Array of dim (number of samples) that stores the subgroup id ('0', '1', …) for each cell.
`.uns['kmeans_polis' \| key_added]['params']`	A dict with the values for the k-means parameters.

Examples:

Basic usage:

import valency_anndata as val
adata = val.datasets.aufstehen()
val.tools.recipe_polis(adata)
val.viz.embedding(adata, basis="pca_polis", color="kmeans_polis")

Use with highly variable statement filtering:

import valency_anndata as val
adata = val.datasets.aufstehen()
# First identify highly variable statements
val.preprocessing.highly_variable_statements(adata, n_top_statements=100)
# Run Polis recipe using only highly variable statements for PCA
val.tools.recipe_polis(adata, mask_var="highly_variable")
# Visualize the results
val.viz.embedding(adata, basis="pca_polis", color="kmeans_polis")

Source code in src/valency_anndata/tools/_polis.py

def recipe_polis(
    adata: AnnData,
    *,
    participant_vote_threshold: int = 7,
    key_added_pca: str = "X_pca_polis",
    key_added_kmeans: str = "kmeans_polis",
    mask_var: str | None = None,
    inplace: bool = True,
):
    """
    Projects and clusters participants as of [[Small _et al._,
    2021](http://dx.doi.org/10.6035/recerca.5516)].

    Expects sparse vote matrix [`.X`][anndata.AnnData.X] with `{+1, 0, -1}`
    and `NaN` values.

    Recipe Steps
    ------------

    1. Masks out meta and moderated-out statements with zeros.
    2. Imputes missing matrix votes with statement-wise means.
    3. Runs standard PCA on the imputed matrix.
    4. Runs sparsity-aware scaling on PCA projections.
    5. Calculates a participant mask using 7-vote threshold.
    6. On unmasked rows, calculates k-means clustering for 2 ≤ k ≤ 5,
       selecting the optimal k via silhouette scores.

    Parameters
    ----------

    participant_vote_threshold
        Vote threshold at which each participant will be included in clustering.
    key_added_pca
        If not specified, the PCA embedding is stored as
        [`.obsm`][anndata.AnnData.obsm]`['X_pca_polis']`, the loadings as
        [`.varm`][anndata.AnnData.varm]`['X_pca_polis']`, and the PCA parameters
        in [`.uns`][anndata.AnnData.uns]`['X_pca_polis']`.
        If specified, all are stored instead at `[key_added_pca]`.
    key_added_kmeans
        [`.obs`][anndata.AnnData.obs] key under which to add the cluster labels.
    mask_var
        Column name in `adata.var` to use for masking statements before PCA.
        If provided, only statements where `mask_var` is True will be used.
        If None, uses all statements.
    inplace
        Perform computation inplace or return result.

    Returns
    -------

    .obsm['X_pca_polis' | key_added]
        PCA representation of data.
    .varm['X_pca_polis' | key_added]
        The principal components containing the loadings.
    .uns['X_pca_polis' | key_added]['variance_ratio']
        Ratio of explained variance.
    .uns['X_pca_polis' | key_added]['variance']
        Explained variance, equivalent to the eigenvalues of the covariance matrix.
    .obs['kmeans_polis' | key_added]
        Array of dim (number of samples) that stores the subgroup id ('0', '1', …) for each cell.
    .uns['kmeans_polis' | key_added]['params']
        A dict with the values for the k-means parameters.

    Examples
    --------
    Basic usage:

    ```py
    import valency_anndata as val
    adata = val.datasets.aufstehen()
    val.tools.recipe_polis(adata)
    val.viz.embedding(adata, basis="pca_polis", color="kmeans_polis")
    ```

    Use with highly variable statement filtering:

    ```py
    import valency_anndata as val
    adata = val.datasets.aufstehen()
    # First identify highly variable statements
    val.preprocessing.highly_variable_statements(adata, n_top_statements=100)
    # Run Polis recipe using only highly variable statements for PCA
    val.tools.recipe_polis(adata, mask_var="highly_variable")
    # Visualize the results
    val.viz.embedding(adata, basis="pca_polis", color="kmeans_polis")
    ```

    """
    if not inplace:
        adata = adata.copy()

    # Preconditions
    assert isinstance(adata.X, np.ndarray)

    # 1. Mask statements with zeros
    _zero_mask(
        adata,
        key_added_var_mask="zero_mask",
        key_added_layer="X_masked",
    )

    # 2. Impute
    val.preprocessing.impute(
        adata,
        strategy="mean",
        source_layer="X_masked",
        target_layer="X_masked_imputed_mean",
    )

    # 3. PCA (unscaled)
    pca_kwargs = {
        "layer": "X_masked_imputed_mean",
        "key_added": "X_pca_masked_unscaled",
    }
    if mask_var is None:
        # Explicitly disable highly_variable filtering so PCA doesn't silently
        # filter statements in ways the polis recipe is not expecting.
        pca_kwargs["use_highly_variable"] = False
    else:
        pca_kwargs["mask_var"] = mask_var

    val.tools.pca(adata, **pca_kwargs)

    # 4. Scale PCA using sparsity data
    _sparsity_aware_scaling(
        adata,
        use_rep="X_pca_masked_unscaled",
        key_added=key_added_pca,
    )

    # Create cluster mask for threshold
    _cluster_mask(
        adata,
        participant_vote_threshold=participant_vote_threshold,
        key_added_obs_mask="cluster_mask",
    )

    # 5. KMeans clustering
    val.tools.kmeans(
        adata,
        use_rep=key_added_pca,
        # Force kmeans to only run on first two principle components.
        n_pcs=2,
        k_bounds=(2, 5),
        init="polis",
        mask_obs="cluster_mask",
        key_added=key_added_kmeans,
        inplace=inplace,
    )

    if not inplace:
        return adata

valency_anndata.tools.recipe_polis2_statements ¶

recipe_polis2_statements(
    adata: AnnData,
    *,
    show_progress: bool = False,
    inplace: bool = True,
) -> AnnData | None

Embed and cluster statements (the var axis) using the Polis v2 pipeline.

Reads free-text statement content from .var["content"], produces dense embeddings, projects them to 2-D with UMAP, and attaches a hierarchy of cluster labels — all stored on the var axis so that the results live alongside the statements that produced them.

Requires the optional polis2 dependency group::

pip install valency-anndata[polis2]

Recipe Steps

Embeds each statement's text into a high-dimensional vector space and stores the result in .varm["content_embedding"].
Projects the embeddings to 2-D with UMAP and stores the coordinates in .varm["content_umap"].
Builds a hierarchy of clustering layers (finest → coarsest) and stores them in .varm["evoc_polis2"] (shape n_var × num_layers) with the coarsest layer also surfaced as the categorical column .var["evoc_polis2_top"].

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object whose `.var["content"]` column contains the statement text strings.	required
`show_progress`	`bool`	Show embedding progress bar. When `False` (the default), warnings and progress output from the model-loading libraries are also suppressed.	`False`
`inplace`	`bool`	If `True` (default), mutate adata and return `None`. If `False`, operate on a copy and return it.	`True`

Returns:

Type	Description
Depending on inplace, returns ``None`` or the modified ``AnnData``.
`.varm['content_embedding']`	Dense text embeddings, shape `(n_var, embed_dim)`.
`.varm['content_umap']`	2-D UMAP projection of the embeddings, shape `(n_var, 2)`.
`.varm['evoc_polis2']`	Stacked layers of clustering labels, shape `(n_var, num_layers)`. Column 0 is the finest/bottom; column -1 is the coarsest/top. `-1` = noise.
`.var['evoc_polis2_top']`	Categorical column taken from the coarsest clustering layer (i.e. `evoc_polis2[:, -1]`).

Examples:

adata = val.datasets.polis.chile_protests(translate_to="en")

with val.viz.schematic_diagram(diff_from=adata):
    val.tools.recipe_polis2_statements(adata)

val.viz.embedding(
    # Transpose .var and .obs axes for plotting
    adata.transpose(),
    basis="content_umap",
    color=["evoc_polis2_top", "moderation_state"],
)

Source code in src/valency_anndata/tools/_polis2.py

def recipe_polis2_statements(adata: AnnData, *, show_progress: bool = False, inplace: bool = True) -> AnnData | None:
    """Embed and cluster **statements** (the var axis) using the Polis v2 pipeline.

    Reads free-text statement content from ``.var["content"]``, produces
    dense embeddings, projects them to 2-D with UMAP, and attaches a
    hierarchy of cluster labels — all stored on the **var** axis so that
    the results live alongside the statements that produced them.

    Requires the optional ``polis2`` dependency group::

        pip install valency-anndata[polis2]

    Recipe Steps
    ------------

    1. Embeds each statement's text into a high-dimensional vector space
       and stores the result in ``.varm["content_embedding"]``.
    2. Projects the embeddings to 2-D with UMAP and stores the coordinates
       in ``.varm["content_umap"]``.
    3. Builds a hierarchy of clustering layers (finest → coarsest) and
       stores them in ``.varm["evoc_polis2"]`` (shape ``n_var × num_layers``)
       with the coarsest layer also surfaced as the categorical column
       ``.var["evoc_polis2_top"]``.

    Parameters
    ----------
    adata :
        AnnData object whose ``.var["content"]`` column contains the
        statement text strings.
    show_progress :
        Show embedding progress bar.  When ``False`` (the default),
        warnings and progress output from the model-loading libraries
        are also suppressed.
    inplace :
        If ``True`` (default), mutate *adata* and return ``None``.
        If ``False``, operate on a copy and return it.

    Returns
    -------
    Depending on *inplace*, returns ``None`` or the modified ``AnnData``.

    .varm['content_embedding']
        Dense text embeddings, shape ``(n_var, embed_dim)``.
    .varm['content_umap']
        2-D UMAP projection of the embeddings, shape ``(n_var, 2)``.
    .varm['evoc_polis2']
        Stacked layers of clustering labels, shape ``(n_var, num_layers)``.
        Column 0 is the finest/bottom; column -1 is the coarsest/top.  ``-1`` = noise.
    .var['evoc_polis2_top']
        Categorical column taken from the coarsest clustering layer
        (i.e. ``evoc_polis2[:, -1]``).

    Examples
    --------

    ```py
    adata = val.datasets.polis.chile_protests(translate_to="en")

    with val.viz.schematic_diagram(diff_from=adata):
        val.tools.recipe_polis2_statements(adata)
    ```

    <img src="../../assets/documentation-examples/tools--polis2--schematic.png">

    ```py
    val.viz.embedding(
        # Transpose .var and .obs axes for plotting
        adata.transpose(),
        basis="content_umap",
        color=["evoc_polis2_top", "moderation_state"],
    )
    ```

    <img src="../../assets/documentation-examples/tools--polis2--plot.png">
    """
    if not inplace:
        adata = adata.copy()

    texts = adata.var["content"].tolist()

    # Suppress noisy warnings / loggers from HF Hub, sentence-transformers
    # and umap during model loading, unless the caller opted into progress.
    @contextmanager
    def _no_op() -> Generator[None, None, None]:
        yield

    ctx = _quiet() if not show_progress else _no_op()
    with ctx:
        adata.varm["content_embedding"] = _embed_statements(texts, show_progress=show_progress)
        content_embedding = np.asarray(adata.varm["content_embedding"])
        adata.varm["content_umap"] = _project_umap(content_embedding)
        cluster_layers = _create_cluster_layers(content_embedding)

    adata.varm["evoc_polis2"] = np.array(cluster_layers).T
    adata.var["evoc_polis2_top"] = adata.varm["evoc_polis2"][:, -1]
    adata.var["evoc_polis2_top"] = (
        adata.var["evoc_polis2_top"]
        # -1 = noise/unassigned; convert to NA so scanpy renders as lightgray.
        .where(adata.var["evoc_polis2_top"] != -1)
        # Nullable int so NAs survive; category for discrete colormap.
        .astype("Int64")
        .astype("category")
    )

    if not inplace:
        return adata

valency_anndata.tools.kmeans ¶

kmeans(
    adata: AnnData,
    use_rep: Optional[str] = None,
    n_pcs: Optional[int] = None,
    k_bounds: Optional[Tuple[int, int]] = None,
    init: Literal[
        "k-means++", "random", "polis"
    ] = "k-means++",
    init_centers: Optional[ndarray] = None,
    random_state: Optional[int] = None,
    mask_obs: NDArray[bool_] | str | None = None,
    key_added: str = "kmeans",
    inplace: bool = True,
) -> AnnData | None

Apply BestPolisKMeans clustering to an AnnData object.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Input data. Must have `.X` as a numpy array.	required
`use_rep`	`Optional[str]`	Representation to use for clustering. If `None`, use `'X_pca'` if present in `adata.obsm`, otherwise fall back to `adata.X`.	`None`
`n_pcs`	`Optional[int]`	Number of dimensions to use from the selected representation. If given, only the first `n_pcs` columns are used.	`None`
`k_bounds`	`Optional[Tuple[int, int]]`	Minimum and maximum number of clusters to try. Defaults to [2, 5].	`None`
`init`	`Literal['k-means++', 'random', 'polis']`	Initialization method for KMeans. Defaults to 'k-means++'.	`'k-means++'`
`init_centers`	`Optional[ndarray]`	Initial cluster centers to use.	`None`
`random_state`	`Optional[int]`	Random seed for reproducibility.	`None`
`mask_obs`	`NDArray[bool_] \| str \| None`	Restrict clustering to a certain set of observations. The mask is specified as a boolean array or a string referring to an array in anndata.AnnData.obs.	`None`
`key_added`	`str`	Name of the column to store cluster labels in `adata.obs`.	`'kmeans'`
`inplace`	`bool`	If True, modify `adata` in place and return None. If False, return a copy with the clustering added.	`True`

Returns:

Type	Description
`AnnData or None`	Returns a copy if `inplace=False`, otherwise modifies in place.

Source code in src/valency_anndata/tools/_kmeans.py

def kmeans(
    adata: AnnData,
    use_rep: Optional[str] = None,
    n_pcs: Optional[int] = None,
    k_bounds: Optional[Tuple[int, int]] = None,
    init: Literal["k-means++", "random", "polis"] = "k-means++",
    init_centers: Optional[np.ndarray] = None,
    random_state: Optional[int] = None,
    mask_obs: NDArray[np.bool_] | str | None = None,
    key_added: str = "kmeans",
    inplace: bool = True,
) -> AnnData | None:
    """
    Apply BestPolisKMeans clustering to an AnnData object.

    Parameters
    ----------
    adata :
        Input data. Must have `.X` as a numpy array.
    use_rep
        Representation to use for clustering. If ``None``, use ``'X_pca'`` if
        present in ``adata.obsm``, otherwise fall back to ``adata.X``.
    n_pcs
        Number of dimensions to use from the selected representation. If given,
        only the first ``n_pcs`` columns are used.
    k_bounds :
        Minimum and maximum number of clusters to try. Defaults to [2, 5].
    init :
        Initialization method for KMeans. Defaults to 'k-means++'.
    init_centers :
        Initial cluster centers to use.
    random_state :
        Random seed for reproducibility.
    mask_obs :
        Restrict clustering to a certain set of observations. The mask is
        specified as a boolean array or a string referring to an array in
        [anndata.AnnData.obs][].
    key_added :
        Name of the column to store cluster labels in `adata.obs`.
    inplace :
        If True, modify `adata` in place and return None.
        If False, return a copy with the clustering added.

    Returns
    -------
    AnnData or None
        Returns a copy if `inplace=False`, otherwise modifies in place.
    """
    X = _choose_representation(adata, use_rep=use_rep, n_pcs=n_pcs)

    if not isinstance(X, np.ndarray):
        raise ValueError("Selected representation must be a numpy array.")

    if k_bounds is None:
        k_bounds_list = [2, 5]
    else:
        k_bounds_list = list(k_bounds)

    mask = _check_mask(adata, mask_obs, "obs")
    if mask is None:
        X_cluster = X
    else:
        X_cluster = X[mask]
        if X_cluster.shape[0] == 0:
            raise ValueError("mask_obs excludes all observations.")

    best_kmeans = BestPolisKMeans(
        k_bounds=k_bounds_list,
        init=init,
        init_centers=init_centers,
        random_state=random_state,
    )
    best_kmeans.fit(X_cluster)

    if not best_kmeans.best_estimator_:
        raise RuntimeError("BestPolisKMeans did not find a valid estimator.")

    raw_labels = best_kmeans.best_estimator_.labels_

    if mask is None:
        full_labels = raw_labels
    else:
        # dtype=object keeps labels from casting to float.
        full_labels = np.full(adata.n_obs, np.nan, dtype=object)
        full_labels[mask] = raw_labels

    labels = pd.Categorical(full_labels)

    def _write_kmeans_result(adata_out: AnnData) -> None:
        adata_out.obs[key_added] = labels

        kmeans_params = dict(
            k_bounds=k_bounds_list,
            best_k=best_kmeans.best_k_,
            best_score=best_kmeans.best_score_,
            init=init,
            random_state=random_state,
            use_rep=use_rep,
            n_pcs=n_pcs,
        )

        adata_out.uns[key_added] = {}
        adata_out.uns[key_added]["params"] = kmeans_params

    if inplace:
        _write_kmeans_result(adata)
        return None
    else:
        adata_copy = adata.copy()
        _write_kmeans_result(adata_copy)
        return adata_copy

valency_anndata.tools.pacmap ¶

pacmap(
    adata: AnnData,
    *,
    layer: str = "X_imputed",
    n_neighbors: Optional[int] = None,
    n_components: int = 2,
    mask_var: str | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None

Compute PaCMAP dimensionality reduction.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object.	required
`layer`	`str`	Layer to use for computation. Default is "X_imputed".	`'X_imputed'`
`n_neighbors`	`Optional[int]`	Number of neighbors for PaCMAP.	`None`
`n_components`	`int`	Number of dimensions for the embedding. Default is 2.	`2`
`mask_var`	`str \| None`	Column name in `adata.var` to use for masking variables. If provided, only variables where `mask_var` is True will be used.	`None`
`key_added`	`str \| None`	Key under which to store the embedding in `adata.obsm`. Default is "X_pacmap".	`None`
`copy`	`bool`	Return a copy instead of modifying adata in place.	`False`

Returns:

Type	Description
`AnnData \| None`	Returns AnnData if `copy=True`, otherwise returns None.

Source code in src/valency_anndata/tools/_pacmap.py

def pacmap(
    adata: AnnData,
    *,
    layer: str = "X_imputed",
    n_neighbors: Optional[int] = None,
    n_components: int = 2,
    mask_var: str | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None:
    """
    Compute PaCMAP dimensionality reduction.

    Parameters
    ----------
    adata
        AnnData object.
    layer
        Layer to use for computation. Default is "X_imputed".
    n_neighbors
        Number of neighbors for PaCMAP.
    n_components
        Number of dimensions for the embedding. Default is 2.
    mask_var
        Column name in `adata.var` to use for masking variables.
        If provided, only variables where `mask_var` is True will be used.
    key_added
        Key under which to store the embedding in `adata.obsm`.
        Default is "X_pacmap".
    copy
        Return a copy instead of modifying adata in place.

    Returns
    -------
    AnnData | None
        Returns AnnData if `copy=True`, otherwise returns None.
    """
    adata = adata.copy() if copy else adata

    key_obsm, key_uns = ("X_pacmap", "pacmap") if key_added is None else [key_added] * 2

    start = logg.info("computing PaCMAP")

    from pacmap import PaCMAP

    estimator = PaCMAP(
        n_components=n_components,
        n_neighbors=n_neighbors,
    )

    # Get data from layer, optionally filtering by mask_var
    X = adata.layers[layer]
    if mask_var is not None:
        mask = adata.var[mask_var].values
        X = X[:, mask]

    X_reduced = estimator.fit_transform(X)

    adata.obsm[key_obsm] = X_reduced

    return adata if copy else None

valency_anndata.tools.localmap ¶

localmap(
    adata: AnnData,
    *,
    layer: str = "X_imputed",
    n_neighbors: Optional[int] = None,
    n_components: int = 2,
    mask_var: str | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None

Compute LocalMAP dimensionality reduction.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object.	required
`layer`	`str`	Layer to use for computation. Default is "X_imputed".	`'X_imputed'`
`n_neighbors`	`Optional[int]`	Number of neighbors for LocalMAP.	`None`
`n_components`	`int`	Number of dimensions for the embedding. Default is 2.	`2`
`mask_var`	`str \| None`	Column name in `adata.var` to use for masking variables. If provided, only variables where `mask_var` is True will be used.	`None`
`key_added`	`str \| None`	Key under which to store the embedding in `adata.obsm`. Default is "X_localmap".	`None`
`copy`	`bool`	Return a copy instead of modifying adata in place.	`False`

Returns:

Type	Description
`AnnData \| None`	Returns AnnData if `copy=True`, otherwise returns None.

Source code in src/valency_anndata/tools/_pacmap.py

def localmap(
    adata: AnnData,
    *,
    layer: str = "X_imputed",
    n_neighbors: Optional[int] = None,
    n_components: int = 2,
    mask_var: str | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None:
    """
    Compute LocalMAP dimensionality reduction.

    Parameters
    ----------
    adata
        AnnData object.
    layer
        Layer to use for computation. Default is "X_imputed".
    n_neighbors
        Number of neighbors for LocalMAP.
    n_components
        Number of dimensions for the embedding. Default is 2.
    mask_var
        Column name in `adata.var` to use for masking variables.
        If provided, only variables where `mask_var` is True will be used.
    key_added
        Key under which to store the embedding in `adata.obsm`.
        Default is "X_localmap".
    copy
        Return a copy instead of modifying adata in place.

    Returns
    -------
    AnnData | None
        Returns AnnData if `copy=True`, otherwise returns None.
    """
    adata = adata.copy() if copy else adata

    key_obsm, key_uns = ("X_localmap", "localmap") if key_added is None else [key_added] * 2

    start = logg.info("computing LocalMAP")

    from pacmap import LocalMAP

    estimator = LocalMAP(
        n_components=n_components,
        n_neighbors=n_neighbors,
    )

    # Get data from layer, optionally filtering by mask_var
    X = adata.layers[layer]
    if mask_var is not None:
        mask = adata.var[mask_var].values
        X = X[:, mask]

    X_reduced = estimator.fit_transform(X)

    adata.obsm[key_obsm] = X_reduced

    return adata if copy else None

`scanpy` methods (inherited)¶

Note

These methods are simply quick convenience wrappers around methods in scanpy, a tool for single-cell gene expression. They will use terms like "cells", "genes" and "counts", but you can think of these as "participants", "statements" and "votes".

See scanpy.tl for more methods you can experiment with via the val.scanpy.tl namespace.

valency_anndata.tools.pca ¶

pca(
    data: AnnData | ndarray | CSBase,
    n_comps: int | None = None,
    *,
    layer: str | None = None,
    zero_center: bool = True,
    svd_solver: SvdSolver | None = None,
    chunked: bool = False,
    chunk_size: int | None = None,
    random_state: _LegacyRandom = 0,
    return_info: bool = False,
    mask_var: NDArray[bool_] | str | None | Empty = _empty,
    use_highly_variable: bool | None = None,
    dtype: DTypeLike = "float32",
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | ndarray | CSBase | None

Principal component analysis :cite:p:Pedregosa2011.

Computes PCA coordinates, loadings and variance decomposition. Uses the following implementations (and defaults for svd_solver):

.. list-table:: :header-rows: 1 :stub-columns: 1

-
- :class:~numpy.ndarray, :class:~scipy.sparse.spmatrix, or :class:~scipy.sparse.sparray
- :class:dask.array.Array
- chunked=False, zero_center=True
- sklearn :class:~sklearn.decomposition.PCA ('arpack')
- - dense: dask-ml :class:~dask_ml.decomposition.PCA\ [#high-mem]_ ('auto')
- sparse or svd_solver='covariance_eigh': custom implementation ('covariance_eigh')
- chunked=False, zero_center=False
- sklearn :class:~sklearn.decomposition.TruncatedSVD ('randomized')
- dask-ml :class:~dask_ml.decomposition.TruncatedSVD\ [#dense-only]_ ('tsqr')
- chunked=True (zero_center ignored)
- sklearn :class:~sklearn.decomposition.IncrementalPCA ('auto')
- dask-ml :class:~dask_ml.decomposition.IncrementalPCA\ [#densifies]_ ('auto')

.. [#high-mem] Consider svd_solver='covariance_eigh' to reduce memory usage (see :issue:dask/dask-ml#985). .. [#dense-only] This implementation can not handle sparse chunks, try manually densifying them. .. [#densifies] This implementation densifies sparse chunks and therefore has increased memory usage.

Parameters:

Name	Type	Description	Default
`data`	`AnnData \| ndarray \| CSBase`	The (annotated) data matrix of shape `n_obs` × `n_vars`. Rows correspond to cells and columns to genes.	required
`n_comps`	`int \| None`	Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.	`None`
`layer`	`str \| None`	If provided, which element of layers to use for PCA.	`None`
`zero_center`	`bool`	If `True`, compute (or approximate) PCA from covariance matrix. If `False`, performa a truncated SVD instead of PCA. Our default PCA algorithms (see `svd_solver`) support implicit zero-centering, and therefore efficiently operating on sparse data.	`True`
`svd_solver`	`SvdSolver \| None`	SVD solver to use. See table above to see which solver class is used based on `chunked` and `zero_center`, as well as the default solver for each class when `svd_solver=None`. Efficient computation of the principal components of a sparse matrix currently only works with the `'arpack`' or `'covariance_eigh`' solver. `None` Choose automatically based on solver class (see table above). `'arpack'` ARPACK wrapper in SciPy (:func:`~scipy.sparse.linalg.svds`). Not available for dask arrays. `'covariance_eigh'` Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices. With dask, array must be CSR or dense and chunked as `(N, adata.shape[1])`. `'randomized'` Randomized algorithm from :cite:t:`Halko2009`. For dask arrays, this will use :func:`~dask.array.linalg.svd_compressed`. `'auto'` Choose automatically depending on the size of the problem: Will use `'full'` for small shapes and `'randomized'` for large shapes. `'tsqr'` “tall-and-skinny QR” algorithm from :cite:t:`Benson2013`. Only available for dense dask arrays. .. versionchanged:: 1.9.3 Default value changed from `'arpack'` to None. .. versionchanged:: 1.4.5 Default value changed from `'auto'` to `'arpack'`.	`None`
`chunked`	`bool`	If `True`, perform an incremental PCA on segments of `chunk_size`. Automatically zero centers and ignores settings of `zero_center`, `random_seed` and `svd_solver`. If `False`, perform a full PCA/truncated SVD (see `svd_solver` and `zero_center`). See table above for which solver class is used.	`False`
`chunk_size`	`int \| None`	Number of observations to include in each chunk. Required if `chunked=True` was passed.	`None`
`random_state`	`_LegacyRandom`	Change to use different initial states for the optimization.	`0`
`return_info`	`bool`	Only relevant when not passing an :class:`~anndata.AnnData`: see “Returns”.	`False`
`layer`	`str \| None`	Layer of `adata` to use as expression values.	`None`
`dtype`	`DTypeLike`	Numpy data type string to which to convert the result.	`'float32'`
`key_added`	`str \| None`	If not specified, the embedding is stored as :attr:`~anndata.AnnData.obsm`\ `['X_pca']`, the loadings as :attr:`~anndata.AnnData.varm`\ `['PCs']`, and the the parameters in :attr:`~anndata.AnnData.uns`\ `['pca']`. If specified, the embedding is stored as :attr:`~anndata.AnnData.obsm`\ `[key_added]`, the loadings as :attr:`~anndata.AnnData.varm`\ `[key_added]`, and the the parameters in :attr:`~anndata.AnnData.uns`\ `[key_added]`.	`None`
`copy`	`bool`	If an :class:`~anndata.AnnData` is passed, determines whether a copy is returned. Is ignored otherwise.	`False`

Returns:

Type	Description
If `data` is array-like and `return_info=False` was passed,
this function returns the PCA representation of `data` as an
`array of the same type as the input array.`
Otherwise, it returns `None` if `copy=False`, else an updated `AnnData` object.
`Sets the following fields:`
`.obsm['X_pca' \| key_added]` : :class:`~scipy.sparse.csr_matrix` \| :class:`~scipy.sparse.csc_matrix` \| :class:`~numpy.ndarray` (shape `(adata.n_obs, n_comps)`)	PCA representation of data.
`.varm['PCs' \| key_added]` : :class:`~numpy.ndarray` (shape `(adata.n_vars, n_comps)`)	The principal components containing the loadings.
`.uns['pca' \| key_added]['variance_ratio']` : :class:`~numpy.ndarray` (shape `(n_comps,)`)	Ratio of explained variance.
`.uns['pca' \| key_added]['variance']` : :class:`~numpy.ndarray` (shape `(n_comps,)`)	Explained variance, equivalent to the eigenvalues of the covariance matrix.

Source code in .venv/lib/python3.10/site-packages/scanpy/preprocessing/_pca/__init__.py

@_doc_params(
    mask_var_hvg=doc_mask_var_hvg,
)
def pca(  # noqa: PLR0912, PLR0913, PLR0915
    data: AnnData | np.ndarray | CSBase,
    n_comps: int | None = None,
    *,
    layer: str | None = None,
    zero_center: bool = True,
    svd_solver: SvdSolver | None = None,
    chunked: bool = False,
    chunk_size: int | None = None,
    random_state: _LegacyRandom = 0,
    return_info: bool = False,
    mask_var: NDArray[np.bool_] | str | None | Empty = _empty,
    use_highly_variable: bool | None = None,
    dtype: DTypeLike = "float32",
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | np.ndarray | CSBase | None:
    r"""Principal component analysis :cite:p:`Pedregosa2011`.

    Computes PCA coordinates, loadings and variance decomposition.
    Uses the following implementations (and defaults for `svd_solver`):

    .. list-table::
       :header-rows: 1
       :stub-columns: 1

       - -
         - :class:`~numpy.ndarray`, :class:`~scipy.sparse.spmatrix`, or :class:`~scipy.sparse.sparray`
         - :class:`dask.array.Array`
       - - `chunked=False`, `zero_center=True`
         - sklearn :class:`~sklearn.decomposition.PCA` (`'arpack'`)
         - - *dense*: dask-ml :class:`~dask_ml.decomposition.PCA`\ [#high-mem]_ (`'auto'`)
           - *sparse* or `svd_solver='covariance_eigh'`: custom implementation (`'covariance_eigh'`)
       - - `chunked=False`, `zero_center=False`
         - sklearn :class:`~sklearn.decomposition.TruncatedSVD` (`'randomized'`)
         - dask-ml :class:`~dask_ml.decomposition.TruncatedSVD`\ [#dense-only]_ (`'tsqr'`)
       - - `chunked=True` (`zero_center` ignored)
         - sklearn :class:`~sklearn.decomposition.IncrementalPCA` (`'auto'`)
         - dask-ml :class:`~dask_ml.decomposition.IncrementalPCA`\ [#densifies]_ (`'auto'`)

    .. [#high-mem] Consider `svd_solver='covariance_eigh'` to reduce memory usage (see :issue:`dask/dask-ml#985`).
    .. [#dense-only] This implementation can not handle sparse chunks, try manually densifying them.
    .. [#densifies] This implementation densifies sparse chunks and therefore has increased memory usage.

    Parameters
    ----------
    data
        The (annotated) data matrix of shape `n_obs` × `n_vars`.
        Rows correspond to cells and columns to genes.
    n_comps
        Number of principal components to compute. Defaults to 50,
        or 1 - minimum dimension size of selected representation.
    layer
        If provided, which element of layers to use for PCA.
    zero_center
        If `True`, compute (or approximate) PCA from covariance matrix.
        If `False`, performa a truncated SVD instead of PCA.

        Our default PCA algorithms (see `svd_solver`) support implicit zero-centering,
        and therefore efficiently operating on sparse data.
    svd_solver
        SVD solver to use.
        See table above to see which solver class is used based on `chunked` and `zero_center`,
        as well as the default solver for each class when `svd_solver=None`.

        Efficient computation of the principal components of a sparse matrix
        currently only works with the `'arpack`' or `'covariance_eigh`' solver.

        `None`
            Choose automatically based on solver class (see table above).
        `'arpack'`
            ARPACK wrapper in SciPy (:func:`~scipy.sparse.linalg.svds`).
            Not available for *dask* arrays.
        `'covariance_eigh'`
            Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices.
            With dask, array must be CSR or dense and chunked as `(N, adata.shape[1])`.
        `'randomized'`
            Randomized algorithm from :cite:t:`Halko2009`.
            For *dask* arrays, this will use :func:`~dask.array.linalg.svd_compressed`.
        `'auto'`
            Choose automatically depending on the size of the problem:
            Will use `'full'` for small shapes and `'randomized'` for large shapes.
        `'tsqr'`
            “tall-and-skinny QR” algorithm from :cite:t:`Benson2013`.
            Only available for dense *dask* arrays.

        .. versionchanged:: 1.9.3
           Default value changed from `'arpack'` to None.
        .. versionchanged:: 1.4.5
           Default value changed from `'auto'` to `'arpack'`.
    chunked
        If `True`, perform an incremental PCA on segments of `chunk_size`.
        Automatically zero centers and ignores settings of `zero_center`, `random_seed` and `svd_solver`.
        If `False`, perform a full PCA/truncated SVD (see `svd_solver` and `zero_center`).
        See table above for which solver class is used.
    chunk_size
        Number of observations to include in each chunk.
        Required if `chunked=True` was passed.
    random_state
        Change to use different initial states for the optimization.
    return_info
        Only relevant when not passing an :class:`~anndata.AnnData`:
        see “Returns”.
    {mask_var_hvg}
    layer
        Layer of `adata` to use as expression values.
    dtype
        Numpy data type string to which to convert the result.
    key_added
        If not specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ `['X_pca']`, the loadings as
        :attr:`~anndata.AnnData.varm`\ `['PCs']`, and the the parameters in
        :attr:`~anndata.AnnData.uns`\ `['pca']`.
        If specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ ``[key_added]``, the loadings as
        :attr:`~anndata.AnnData.varm`\ ``[key_added]``, and the the parameters in
        :attr:`~anndata.AnnData.uns`\ ``[key_added]``.
    copy
        If an :class:`~anndata.AnnData` is passed, determines whether a copy
        is returned. Is ignored otherwise.

    Returns
    -------
    If `data` is array-like and `return_info=False` was passed,
    this function returns the PCA representation of `data` as an
    array of the same type as the input array.

    Otherwise, it returns `None` if `copy=False`, else an updated `AnnData` object.
    Sets the following fields:

    `.obsm['X_pca' | key_added]` : :class:`~scipy.sparse.csr_matrix` | :class:`~scipy.sparse.csc_matrix` | :class:`~numpy.ndarray` (shape `(adata.n_obs, n_comps)`)
        PCA representation of data.
    `.varm['PCs' | key_added]` : :class:`~numpy.ndarray` (shape `(adata.n_vars, n_comps)`)
        The principal components containing the loadings.
    `.uns['pca' | key_added]['variance_ratio']` : :class:`~numpy.ndarray` (shape `(n_comps,)`)
        Ratio of explained variance.
    `.uns['pca' | key_added]['variance']` : :class:`~numpy.ndarray` (shape `(n_comps,)`)
        Explained variance, equivalent to the eigenvalues of the
        covariance matrix.

    """
    logg_start = logg.info("computing PCA")
    if layer is not None and chunked:
        # Current chunking implementation relies on pca being called on X
        msg = "Cannot use `layer` and `chunked` at the same time."
        raise NotImplementedError(msg)

    # chunked calculation is not randomized, anyways
    if svd_solver in {"auto", "randomized"} and not chunked:
        logg.info(
            "Note that scikit-learn's randomized PCA might not be exactly "
            "reproducible across different computational platforms. For exact "
            "reproducibility, choose `svd_solver='arpack'`."
        )
    if return_anndata := isinstance(data, AnnData):
        if layer is None and not chunked and is_backed_type(data.X):
            msg = f"PCA is not implemented for matrices of type {type(data.X)} with chunked as False"
            raise NotImplementedError(msg)
        adata = data.copy() if copy else data
    elif pkg_version("anndata") < Version("0.8.0rc1"):
        adata = AnnData(data, dtype=data.dtype)
    else:
        adata = AnnData(data)

    # Unify new mask argument and deprecated use_highly_varible argument
    mask_var_param, mask_var = _handle_mask_var(
        adata, mask_var, use_highly_variable=use_highly_variable
    )
    del use_highly_variable
    adata_comp = adata[:, mask_var] if mask_var is not None else adata

    if n_comps is None:
        min_dim = min(adata_comp.n_vars, adata_comp.n_obs)
        n_comps = min_dim - 1 if min_dim <= settings.N_PCS else settings.N_PCS

    logg.info(f"    with {n_comps=}")

    x = _get_obs_rep(adata_comp, layer=layer)
    if is_backed_type(x) and layer is not None:
        msg = f"PCA is not implemented for matrices of type {type(x)} from layers"
        raise NotImplementedError(msg)
    # See: https://github.com/scverse/scanpy/pull/2816#issuecomment-1932650529
    if (
        pkg_version("anndata") < Version("0.9")
        and mask_var is not None
        and isinstance(x, np.ndarray)
    ):
        warnings.warn(
            "When using a mask parameter with anndata<0.9 on a dense array, the PCA"
            "can have slightly different results due the array being column major "
            "instead of row major.",
            UserWarning,
            stacklevel=2,
        )

    # check_random_state returns a numpy RandomState when passed an int but
    # dask needs an int for random state
    if not isinstance(x, DaskArray):
        random_state = check_random_state(random_state)
    elif not isinstance(random_state, int):
        msg = f"random_state needs to be an int, not a {type(random_state).__name__} when passing a dask array"
        raise TypeError(msg)

    if chunked:
        if (
            not zero_center
            or random_state
            or (svd_solver is not None and svd_solver != "arpack")
        ):
            logg.debug("Ignoring zero_center, random_state, svd_solver")

        incremental_pca_kwargs = dict()
        if isinstance(x, DaskArray):
            from dask.array import zeros
            from dask_ml.decomposition import IncrementalPCA

            incremental_pca_kwargs["svd_solver"] = _handle_dask_ml_args(
                svd_solver, IncrementalPCA
            )
        else:
            from numpy import zeros
            from sklearn.decomposition import IncrementalPCA

        x_pca = zeros((x.shape[0], n_comps), x.dtype)

        pca_ = IncrementalPCA(n_components=n_comps, **incremental_pca_kwargs)

        for chunk, _, _ in adata_comp.chunked_X(chunk_size):
            chunk_dense = chunk.toarray() if isinstance(chunk, CSBase) else chunk
            pca_.partial_fit(chunk_dense)

        for chunk, start, end in adata_comp.chunked_X(chunk_size):
            chunk_dense = chunk.toarray() if isinstance(chunk, CSBase) else chunk
            x_pca[start:end] = pca_.transform(chunk_dense)
    elif zero_center:
        if isinstance(x, CSBase) and (
            pkg_version("scikit-learn") < Version("1.4") or svd_solver == "lobpcg"
        ):
            if svd_solver not in (
                {"lobpcg"} | get_literal_vals(SvdSolvPCASparseSklearn)
            ):
                if svd_solver is not None:
                    msg = (
                        f"Ignoring {svd_solver=} and using 'arpack', "
                        "sparse PCA with sklearn < 1.4 only supports 'lobpcg' and 'arpack'."
                    )
                    warnings.warn(msg, UserWarning, stacklevel=2)
                svd_solver = "arpack"
            elif svd_solver == "lobpcg":
                msg = (
                    f"{svd_solver=} for sparse relies on legacy code and will not be supported in the future. "
                    "Also the lobpcg solver has been observed to be inaccurate. Please use 'arpack' instead."
                )
                warnings.warn(msg, FutureWarning, stacklevel=2)
            x_pca, pca_ = _pca_compat_sparse(
                x, n_comps, solver=svd_solver, random_state=random_state
            )
        else:
            if not isinstance(x, DaskArray):
                from sklearn.decomposition import PCA

                svd_solver = _handle_sklearn_args(
                    svd_solver, PCA, sparse=isinstance(x, CSBase)
                )
                pca_ = PCA(
                    n_components=n_comps,
                    svd_solver=svd_solver,
                    random_state=random_state,
                )
            elif isinstance(x._meta, CSBase) or svd_solver == "covariance_eigh":
                from ._dask import PCAEighDask

                if random_state != 0:
                    msg = f"Ignoring {random_state=} when using a sparse dask array"
                    warnings.warn(msg, UserWarning, stacklevel=2)
                if svd_solver not in {None, "covariance_eigh"}:
                    msg = f"Ignoring {svd_solver=} when using a sparse dask array"
                    warnings.warn(msg, UserWarning, stacklevel=2)
                pca_ = PCAEighDask(n_components=n_comps)
            else:
                from dask_ml.decomposition import PCA

                svd_solver = _handle_dask_ml_args(svd_solver, PCA)
                pca_ = PCA(
                    n_components=n_comps,
                    svd_solver=svd_solver,
                    random_state=random_state,
                )
            x_pca = pca_.fit_transform(x)
    else:
        if isinstance(x, DaskArray):
            if isinstance(x._meta, CSBase):
                msg = (
                    "`zero_center=False` is not supported for sparse Dask arrays (yet). "
                    "See <https://github.com/dask/dask-ml/issues/123>."
                )
                raise TypeError(msg)
            from dask_ml.decomposition import TruncatedSVD

            svd_solver = _handle_dask_ml_args(svd_solver, TruncatedSVD)
        else:
            from sklearn.decomposition import TruncatedSVD

            svd_solver = _handle_sklearn_args(svd_solver, TruncatedSVD)

        logg.debug(
            "    without zero-centering: \n"
            "    the explained variance does not correspond to the exact statistical definition\n"
            "    the first component, e.g., might be heavily influenced by different means\n"
            "    the following components often resemble the exact PCA very closely"
        )
        pca_ = TruncatedSVD(
            n_components=n_comps, random_state=random_state, algorithm=svd_solver
        )
        x_pca = pca_.fit_transform(x)

    if x_pca.dtype.descr != np.dtype(dtype).descr:
        x_pca = x_pca.astype(dtype)

    if return_anndata:
        key_obsm, key_varm, key_uns = (
            ("X_pca", "PCs", "pca") if key_added is None else [key_added] * 3
        )
        adata.obsm[key_obsm] = x_pca

        if mask_var is not None:
            adata.varm[key_varm] = np.zeros(shape=(adata.n_vars, n_comps))
            adata.varm[key_varm][mask_var] = pca_.components_.T
        else:
            adata.varm[key_varm] = pca_.components_.T

        params = dict(
            zero_center=zero_center,
            use_highly_variable=mask_var_param == "highly_variable",
            mask_var=mask_var_param,
        )
        if layer is not None:
            params["layer"] = layer
        adata.uns[key_uns] = dict(
            params=params,
            variance=pca_.explained_variance_,
            variance_ratio=pca_.explained_variance_ratio_,
        )

        logg.info("    finished", time=logg_start)
        logg.debug(
            "and added\n"
            f"    {key_obsm!r}, the PCA coordinates (adata.obs)\n"
            f"    {key_varm!r}, the loadings (adata.varm)\n"
            f"    'pca_variance', the variance / eigenvalues (adata.uns[{key_uns!r}])\n"
            f"    'pca_variance_ratio', the variance ratio (adata.uns[{key_uns!r}])"
        )
        return adata if copy else None
    else:
        logg.info("    finished", time=logg_start)
        if return_info:
            return (
                x_pca,
                pca_.components_,
                pca_.explained_variance_ratio_,
                pca_.explained_variance_,
            )
        else:
            return x_pca

valency_anndata.tools.tsne ¶

tsne(
    adata: AnnData,
    n_pcs: int | None = None,
    *,
    use_rep: str | None = None,
    perplexity: float = 30,
    metric: str = "euclidean",
    early_exaggeration: float = 12,
    learning_rate: float = 1000,
    random_state: _LegacyRandom = 0,
    use_fast_tsne: bool = False,
    n_jobs: int | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None

t-SNE :cite:p:vanDerMaaten2008,Amir2013,Pedregosa2011.

t-distributed stochastic neighborhood embedding (tSNE, :cite:t:vanDerMaaten2008) was proposed for visualizating single-cell data by :cite:t:Amir2013. Here, by default, we use the implementation of scikit-learn :cite:p:Pedregosa2011. You can achieve a huge speedup and better convergence if you install Multicore-tSNE_ by :cite:t:Ulyanov2016, which will be automatically detected by Scanpy.

.. _multicore-tsne: https://github.com/DmitryUlyanov/Multicore-TSNE

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix.	required
`perplexity`	`float`	The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter.	`30`
`metric`	`str`	Distance metric calculate neighbors on.	`'euclidean'`
`early_exaggeration`	`float`	Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high.	`12`
`learning_rate`	`float`	Note that the R-package "Rtsne" uses a default of 200. The learning rate can be a critical parameter. It should be between 100 and 1000. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. If the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes.	`1000`
`random_state`	`_LegacyRandom`	Change this to use different intial states for the optimization. If `None`, the initial state is not reproducible.	`0`
`n_jobs`	`int \| None`	Number of jobs for parallel computation. `None` means using :attr:`scanpy.settings.n_jobs`.	`None`
`key_added`	`str \| None`	If not specified, the embedding is stored as :attr:`~anndata.AnnData.obsm`\ `['X_tsne']` and the the parameters in :attr:`~anndata.AnnData.uns`\ `['tsne']`. If specified, the embedding is stored as :attr:`~anndata.AnnData.obsm`\ `[key_added]` and the the parameters in :attr:`~anndata.AnnData.uns`\ `[key_added]`.	`None`
`copy`	`bool`	Return a copy instead of writing to `adata`.	`False`

Returns:

Type	Description
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:
`adata.obsm['X_tsne' \| key_added]` : :class:`numpy.ndarray` (dtype `float`)	tSNE coordinates of data.
`adata.uns['tsne' \| key_added]` : :class:`dict`	tSNE parameters.

Source code in .venv/lib/python3.10/site-packages/scanpy/tools/_tsne.py

@old_positionals(
    "use_rep",
    "perplexity",
    "early_exaggeration",
    "learning_rate",
    "random_state",
    "use_fast_tsne",
    "n_jobs",
    "copy",
)
@_doc_params(doc_n_pcs=doc_n_pcs, use_rep=doc_use_rep)
def tsne(  # noqa: PLR0913
    adata: AnnData,
    n_pcs: int | None = None,
    *,
    use_rep: str | None = None,
    perplexity: float = 30,
    metric: str = "euclidean",
    early_exaggeration: float = 12,
    learning_rate: float = 1000,
    random_state: _LegacyRandom = 0,
    use_fast_tsne: bool = False,
    n_jobs: int | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None:
    r"""t-SNE :cite:p:`vanDerMaaten2008,Amir2013,Pedregosa2011`.

    t-distributed stochastic neighborhood embedding (tSNE, :cite:t:`vanDerMaaten2008`) was
    proposed for visualizating single-cell data by :cite:t:`Amir2013`. Here, by default,
    we use the implementation of *scikit-learn* :cite:p:`Pedregosa2011`. You can achieve
    a huge speedup and better convergence if you install Multicore-tSNE_
    by :cite:t:`Ulyanov2016`, which will be automatically detected by Scanpy.

    .. _multicore-tsne: https://github.com/DmitryUlyanov/Multicore-TSNE

    Parameters
    ----------
    adata
        Annotated data matrix.
    {doc_n_pcs}
    {use_rep}
    perplexity
        The perplexity is related to the number of nearest neighbors that
        is used in other manifold learning algorithms. Larger datasets
        usually require a larger perplexity. Consider selecting a value
        between 5 and 50. The choice is not extremely critical since t-SNE
        is quite insensitive to this parameter.
    metric
        Distance metric calculate neighbors on.
    early_exaggeration
        Controls how tight natural clusters in the original space are in the
        embedded space and how much space will be between them. For larger
        values, the space between natural clusters will be larger in the
        embedded space. Again, the choice of this parameter is not very
        critical. If the cost function increases during initial optimization,
        the early exaggeration factor or the learning rate might be too high.
    learning_rate
        Note that the R-package "Rtsne" uses a default of 200.
        The learning rate can be a critical parameter. It should be
        between 100 and 1000. If the cost function increases during initial
        optimization, the early exaggeration factor or the learning rate
        might be too high. If the cost function gets stuck in a bad local
        minimum increasing the learning rate helps sometimes.
    random_state
        Change this to use different intial states for the optimization.
        If `None`, the initial state is not reproducible.
    n_jobs
        Number of jobs for parallel computation.
        `None` means using :attr:`scanpy.settings.n_jobs`.
    key_added
        If not specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ `['X_tsne']` and the the parameters in
        :attr:`~anndata.AnnData.uns`\ `['tsne']`.
        If specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ ``[key_added]`` and the the parameters in
        :attr:`~anndata.AnnData.uns`\ ``[key_added]``.
    copy
        Return a copy instead of writing to `adata`.

    Returns
    -------
    Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:

    `adata.obsm['X_tsne' | key_added]` : :class:`numpy.ndarray` (dtype `float`)
        tSNE coordinates of data.
    `adata.uns['tsne' | key_added]` : :class:`dict`
        tSNE parameters.

    """
    start = logg.info("computing tSNE")
    adata = adata.copy() if copy else adata
    x = _choose_representation(adata, use_rep=use_rep, n_pcs=n_pcs)
    raise_not_implemented_error_if_backed_type(x, "tsne")
    # params for sklearn
    n_jobs = settings.n_jobs if n_jobs is None else n_jobs
    params_sklearn = dict(
        perplexity=perplexity,
        random_state=random_state,
        verbose=settings.verbosity > 3,
        early_exaggeration=early_exaggeration,
        learning_rate=learning_rate,
        n_jobs=n_jobs,
        metric=metric,
    )
    if metric != "euclidean" and (pkg_version("scikit-learn") < Version("1.3.0rc1")):
        params_sklearn["square_distances"] = True

    # Backwards compat handling: Remove in scanpy 1.9.0
    if n_jobs != 1 and not use_fast_tsne:
        warnings.warn(
            "In previous versions of scanpy, calling tsne with n_jobs > 1 would use "
            "MulticoreTSNE. Now this uses the scikit-learn version of TSNE by default. "
            "If you'd like the old behaviour (which is deprecated), pass "
            "'use_fast_tsne=True'. Note, MulticoreTSNE is not actually faster anymore.",
            UserWarning,
            stacklevel=2,
        )
    if use_fast_tsne:
        warnings.warn(
            "Argument `use_fast_tsne` is deprecated, and support for MulticoreTSNE "
            "will be dropped in a future version of scanpy.",
            FutureWarning,
            stacklevel=2,
        )

    # deal with different tSNE implementations
    if use_fast_tsne:
        try:
            from MulticoreTSNE import MulticoreTSNE as TSNE  # noqa: N814

            tsne = TSNE(**params_sklearn)
            logg.info("    using the 'MulticoreTSNE' package by Ulyanov (2017)")
            # need to transform to float64 for MulticoreTSNE...
            x_tsne = tsne.fit_transform(x.astype("float64"))
        except ImportError:
            use_fast_tsne = False
            warnings.warn(
                "Could not import 'MulticoreTSNE'. Falling back to scikit-learn.",
                UserWarning,
                stacklevel=2,
            )
    if use_fast_tsne is False:  # In case MultiCore failed to import
        from sklearn.manifold import TSNE

        # unfortunately, sklearn does not allow to set a minimum number
        # of iterations for barnes-hut tSNE
        tsne = TSNE(**params_sklearn)
        logg.info("    using sklearn.manifold.TSNE")
        x_tsne = tsne.fit_transform(x)

    # update AnnData instance
    params = dict(
        perplexity=perplexity,
        early_exaggeration=early_exaggeration,
        learning_rate=learning_rate,
        n_jobs=n_jobs,
        metric=metric,
        use_rep=use_rep,
    )
    key_uns, key_obsm = ("tsne", "X_tsne") if key_added is None else [key_added] * 2
    adata.obsm[key_obsm] = x_tsne  # annotate samples with tSNE coordinates
    adata.uns[key_uns] = dict(params={k: v for k, v in params.items() if v is not None})

    logg.info(
        "    finished",
        time=start,
        deep=(
            f"added\n"
            f"    {key_obsm!r}, tSNE coordinates (adata.obsm)\n"
            f"    {key_uns!r}, tSNE parameters (adata.uns)"
        ),
    )

    return adata if copy else None

valency_anndata.tools.umap ¶

umap(
    adata: AnnData,
    *,
    min_dist: float = 0.5,
    spread: float = 1.0,
    n_components: int = 2,
    maxiter: int | None = None,
    alpha: float = 1.0,
    gamma: float = 1.0,
    negative_sample_rate: int = 5,
    init_pos: _InitPos | ndarray | None = "spectral",
    random_state: _LegacyRandom = 0,
    a: float | None = None,
    b: float | None = None,
    method: Literal["umap", "rapids"] = "umap",
    key_added: str | None = None,
    neighbors_key: str = "neighbors",
    copy: bool = False,
) -> AnnData | None

Embed the neighborhood graph using UMAP :cite:p:McInnes2018.

UMAP (Uniform Manifold Approximation and Projection) is a manifold learning technique suitable for visualizing high-dimensional data. Besides tending to be faster than tSNE, it optimizes the embedding such that it best reflects the topology of the data, which we represent throughout Scanpy using a neighborhood graph. tSNE, by contrast, optimizes the distribution of nearest-neighbor distances in the embedding such that these best match the distribution of distances in the high-dimensional space. We use the implementation of umap-learn_ :cite:p:McInnes2018. For a few comparisons of UMAP with tSNE, see :cite:t:Becht2018.

.. _umap-learn: https://github.com/lmcinnes/umap

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix.	required
`min_dist`	`float`	The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the `spread` value, which determines the scale at which embedded points will be spread out. The default of in the `umap-learn` package is 0.1.	`0.5`
`spread`	`float`	The effective scale of embedded points. In combination with `min_dist` this determines how clustered/clumped the embedded points are.	`1.0`
`n_components`	`int`	The number of dimensions of the embedding.	`2`
`maxiter`	`int \| None`	The number of iterations (epochs) of the optimization. Called `n_epochs` in the original UMAP.	`None`
`alpha`	`float`	The initial learning rate for the embedding optimization.	`1.0`
`gamma`	`float`	Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.	`1.0`
`negative_sample_rate`	`int`	The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding.	`5`
`init_pos`	`_InitPos \| ndarray \| None`	How to initialize the low dimensional embedding. Called `init` in the original UMAP. Options are: Any key for `adata.obsm`. 'paga': positions from :func:`~scanpy.pl.paga`. 'spectral': use a spectral embedding of the graph. 'random': assign initial embedding positions at random. A numpy array of initial embedding positions.	`'spectral'`
`random_state`	`_LegacyRandom`	If `int`, `random_state` is the seed used by the random number generator; If `RandomState` or `Generator`, `random_state` is the random number generator; If `None`, the random number generator is the `RandomState` instance used by `np.random`.	`0`
`a`	`float \| None`	More specific parameters controlling the embedding. If `None` these values are set automatically as determined by `min_dist` and `spread`.	`None`
`b`	`float \| None`	More specific parameters controlling the embedding. If `None` these values are set automatically as determined by `min_dist` and `spread`.	`None`
`method`	`Literal['umap', 'rapids']`	Chosen implementation. `'umap'` Umap’s simplical set embedding. `'rapids'` GPU accelerated implementation. .. deprecated:: 1.10.0 Use :func:`rapids_singlecell.tl.umap` instead.	`'umap'`
`key_added`	`str \| None`	If not specified, the embedding is stored as :attr:`~anndata.AnnData.obsm`\ `['X_umap']` and the the parameters in :attr:`~anndata.AnnData.uns`\ `['umap']`. If specified, the embedding is stored as :attr:`~anndata.AnnData.obsm`\ `[key_added]` and the the parameters in :attr:`~anndata.AnnData.uns`\ `[key_added]`.	`None`
`neighbors_key`	`str`	Umap looks in :attr:`~anndata.AnnData.uns`\ `[neighbors_key]` for neighbors settings and :attr:`~anndata.AnnData.obsp`\ `[.uns[neighbors_key]['connectivities_key']]` for connectivities.	`'neighbors'`
`copy`	`bool`	Return a copy instead of writing to adata.	`False`

Returns:

Type	Description
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:
`adata.obsm['X_umap' \| key_added]` : :class:`numpy.ndarray` (dtype `float`)	UMAP coordinates of data.
`adata.uns['umap' \| key_added]` : :class:`dict`	UMAP parameters.

Source code in .venv/lib/python3.10/site-packages/scanpy/tools/_umap.py

@old_positionals(
    "min_dist",
    "spread",
    "n_components",
    "maxiter",
    "alpha",
    "gamma",
    "negative_sample_rate",
    "init_pos",
    "random_state",
    "a",
    "b",
    "copy",
    "method",
    "neighbors_key",
)
def umap(  # noqa: PLR0913, PLR0915
    adata: AnnData,
    *,
    min_dist: float = 0.5,
    spread: float = 1.0,
    n_components: int = 2,
    maxiter: int | None = None,
    alpha: float = 1.0,
    gamma: float = 1.0,
    negative_sample_rate: int = 5,
    init_pos: _InitPos | np.ndarray | None = "spectral",
    random_state: _LegacyRandom = 0,
    a: float | None = None,
    b: float | None = None,
    method: Literal["umap", "rapids"] = "umap",
    key_added: str | None = None,
    neighbors_key: str = "neighbors",
    copy: bool = False,
) -> AnnData | None:
    r"""Embed the neighborhood graph using UMAP :cite:p:`McInnes2018`.

    UMAP (Uniform Manifold Approximation and Projection) is a manifold learning
    technique suitable for visualizing high-dimensional data. Besides tending to
    be faster than tSNE, it optimizes the embedding such that it best reflects
    the topology of the data, which we represent throughout Scanpy using a
    neighborhood graph. tSNE, by contrast, optimizes the distribution of
    nearest-neighbor distances in the embedding such that these best match the
    distribution of distances in the high-dimensional space.
    We use the implementation of umap-learn_ :cite:p:`McInnes2018`.
    For a few comparisons of UMAP with tSNE, see :cite:t:`Becht2018`.

    .. _umap-learn: https://github.com/lmcinnes/umap

    Parameters
    ----------
    adata
        Annotated data matrix.
    min_dist
        The effective minimum distance between embedded points. Smaller values
        will result in a more clustered/clumped embedding where nearby points on
        the manifold are drawn closer together, while larger values will result
        on a more even dispersal of points. The value should be set relative to
        the ``spread`` value, which determines the scale at which embedded
        points will be spread out. The default of in the `umap-learn` package is
        0.1.
    spread
        The effective scale of embedded points. In combination with `min_dist`
        this determines how clustered/clumped the embedded points are.
    n_components
        The number of dimensions of the embedding.
    maxiter
        The number of iterations (epochs) of the optimization. Called `n_epochs`
        in the original UMAP.
    alpha
        The initial learning rate for the embedding optimization.
    gamma
        Weighting applied to negative samples in low dimensional embedding
        optimization. Values higher than one will result in greater weight
        being given to negative samples.
    negative_sample_rate
        The number of negative edge/1-simplex samples to use per positive
        edge/1-simplex sample in optimizing the low dimensional embedding.
    init_pos
        How to initialize the low dimensional embedding. Called `init` in the
        original UMAP. Options are:

        * Any key for `adata.obsm`.
        * 'paga': positions from :func:`~scanpy.pl.paga`.
        * 'spectral': use a spectral embedding of the graph.
        * 'random': assign initial embedding positions at random.
        * A numpy array of initial embedding positions.
    random_state
        If `int`, `random_state` is the seed used by the random number generator;
        If `RandomState` or `Generator`, `random_state` is the random number generator;
        If `None`, the random number generator is the `RandomState` instance used
        by `np.random`.
    a
        More specific parameters controlling the embedding. If `None` these
        values are set automatically as determined by `min_dist` and
        `spread`.
    b
        More specific parameters controlling the embedding. If `None` these
        values are set automatically as determined by `min_dist` and
        `spread`.
    method
        Chosen implementation.

        ``'umap'``
            Umap’s simplical set embedding.
        ``'rapids'``
            GPU accelerated implementation.

            .. deprecated:: 1.10.0
                Use :func:`rapids_singlecell.tl.umap` instead.
    key_added
        If not specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ `['X_umap']` and the the parameters in
        :attr:`~anndata.AnnData.uns`\ `['umap']`.
        If specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ ``[key_added]`` and the the parameters in
        :attr:`~anndata.AnnData.uns`\ ``[key_added]``.
    neighbors_key
        Umap looks in
        :attr:`~anndata.AnnData.uns`\ ``[neighbors_key]`` for neighbors settings and
        :attr:`~anndata.AnnData.obsp`\ ``[.uns[neighbors_key]['connectivities_key']]`` for connectivities.
    copy
        Return a copy instead of writing to adata.

    Returns
    -------
    Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:

    `adata.obsm['X_umap' | key_added]` : :class:`numpy.ndarray` (dtype `float`)
        UMAP coordinates of data.
    `adata.uns['umap' | key_added]` : :class:`dict`
        UMAP parameters.

    """
    adata = adata.copy() if copy else adata

    key_obsm, key_uns = ("X_umap", "umap") if key_added is None else [key_added] * 2

    if neighbors_key is None:  # backwards compat
        neighbors_key = "neighbors"
    if neighbors_key not in adata.uns:
        msg = f"Did not find .uns[{neighbors_key!r}]. Run `sc.pp.neighbors` first."
        raise ValueError(msg)

    start = logg.info("computing UMAP")

    neighbors = NeighborsView(adata, neighbors_key)

    if "params" not in neighbors or neighbors["params"]["method"] != "umap":
        logg.warning(
            f'.obsp["{neighbors["connectivities_key"]}"] have not been computed using umap'
        )

    with warnings.catch_warnings():
        # umap 0.5.0
        warnings.filterwarnings("ignore", message=r"Tensorflow not installed")
        import umap

    from umap.umap_ import find_ab_params, simplicial_set_embedding

    if a is None or b is None:
        a, b = find_ab_params(spread, min_dist)
    adata.uns[key_uns] = dict(params=dict(a=a, b=b))
    if isinstance(init_pos, str) and init_pos in adata.obsm:
        init_coords = adata.obsm[init_pos]
    elif isinstance(init_pos, str) and init_pos == "paga":
        init_coords = get_init_pos_from_paga(
            adata, random_state=random_state, neighbors_key=neighbors_key
        )
    else:
        init_coords = init_pos  # Let umap handle it
    if hasattr(init_coords, "dtype"):
        init_coords = check_array(init_coords, dtype=np.float32, accept_sparse=False)

    if random_state != 0:
        adata.uns[key_uns]["params"]["random_state"] = random_state
    random_state = check_random_state(random_state)

    neigh_params = neighbors["params"]
    x = _choose_representation(
        adata,
        use_rep=neigh_params.get("use_rep", None),
        n_pcs=neigh_params.get("n_pcs", None),
        silent=True,
    )
    if method == "umap":
        # the data matrix X is really only used for determining the number of connected components
        # for the init condition in the UMAP embedding
        default_epochs = 500 if neighbors["connectivities"].shape[0] <= 10000 else 200
        n_epochs = default_epochs if maxiter is None else maxiter
        x_umap, _ = simplicial_set_embedding(
            data=x,
            graph=neighbors["connectivities"].tocoo(),
            n_components=n_components,
            initial_alpha=alpha,
            a=a,
            b=b,
            gamma=gamma,
            negative_sample_rate=negative_sample_rate,
            n_epochs=n_epochs,
            init=init_coords,
            random_state=random_state,
            metric=neigh_params.get("metric", "euclidean"),
            metric_kwds=neigh_params.get("metric_kwds", {}),
            densmap=False,
            densmap_kwds={},
            output_dens=False,
            verbose=settings.verbosity > 3,
        )
    elif method == "rapids":
        msg = (
            "`method='rapids'` is deprecated. "
            "Use `rapids_singlecell.tl.louvain` instead."
        )
        warnings.warn(msg, FutureWarning, stacklevel=2)
        metric = neigh_params.get("metric", "euclidean")
        if metric != "euclidean":
            msg = (
                f"`sc.pp.neighbors` was called with `metric` {metric!r}, "
                "but umap `method` 'rapids' only supports the 'euclidean' metric."
            )
            raise ValueError(msg)
        from cuml import UMAP

        n_neighbors = neighbors["params"]["n_neighbors"]
        n_epochs = (
            500 if maxiter is None else maxiter
        )  # 0 is not a valid value for rapids, unlike original umap
        x_contiguous = np.ascontiguousarray(x, dtype=np.float32)
        umap = UMAP(
            n_neighbors=n_neighbors,
            n_components=n_components,
            n_epochs=n_epochs,
            learning_rate=alpha,
            init=init_pos,
            min_dist=min_dist,
            spread=spread,
            negative_sample_rate=negative_sample_rate,
            a=a,
            b=b,
            verbose=settings.verbosity > 3,
            random_state=random_state,
        )
        x_umap = umap.fit_transform(x_contiguous)
    adata.obsm[key_obsm] = x_umap  # annotate samples with UMAP coordinates
    logg.info(
        "    finished",
        time=start,
        deep=(
            "added\n"
            f"    {key_obsm!r}, UMAP coordinates (adata.obsm)\n"
            f"    {key_uns!r}, UMAP parameters (adata.uns)"
        ),
    )
    return adata if copy else None

valency_anndata.tools.leiden ¶

leiden(
    adata: AnnData,
    resolution: float = 1,
    *,
    restrict_to: tuple[str, Sequence[str]] | None = None,
    random_state: _LegacyRandom = 0,
    key_added: str = "leiden",
    adjacency: CSBase | None = None,
    directed: bool | None = None,
    use_weights: bool = True,
    n_iterations: int = -1,
    partition_type: type[MutableVertexPartition]
    | None = None,
    neighbors_key: str | None = None,
    obsp: str | None = None,
    copy: bool = False,
    flavor: Literal["leidenalg", "igraph"] = "leidenalg",
    **clustering_args,
) -> AnnData | None

Cluster cells into subgroups :cite:p:Traag2019.

Cluster cells using the Leiden algorithm :cite:p:Traag2019, an improved version of the Louvain algorithm :cite:p:Blondel2008. It was proposed for single-cell analysis by :cite:t:Levine2015.

This requires having run :func:~scanpy.pp.neighbors or :func:~scanpy.external.pp.bbknn first.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	The annotated data matrix.	required
`resolution`	`float`	A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. Set to `None` if overriding `partition_type` to one that doesn’t accept a `resolution_parameter`.	`1`
`random_state`	`_LegacyRandom`	Change the initialization of the optimization.	`0`
`restrict_to`	`tuple[str, Sequence[str]] \| None`	Restrict the clustering to the categories within the key for sample annotation, tuple needs to contain `(obs_key, list_of_categories)`.	`None`
`key_added`	`str`	`adata.obs` key under which to add the cluster labels.	`'leiden'`
`adjacency`	`CSBase \| None`	Sparse adjacency matrix of the graph, defaults to neighbors connectivities.	`None`
`directed`	`bool \| None`	Whether to treat the graph as directed or undirected.	`None`
`use_weights`	`bool`	If `True`, edge weights from the graph are used in the computation (placing more emphasis on stronger edges).	`True`
`n_iterations`	`int`	How many iterations of the Leiden clustering algorithm to perform. Positive values above 2 define the total number of iterations to perform, -1 has the algorithm run until it reaches its optimal clustering. 2 is faster and the default for underlying packages.	`-1`
`partition_type`	`type[MutableVertexPartition] \| None`	Type of partition to use. Defaults to :class:`~leidenalg.RBConfigurationVertexPartition`. For the available options, consult the documentation for :func:`~leidenalg.find_partition`.	`None`
`neighbors_key`	`str \| None`	Use neighbors connectivities as adjacency. If not specified, leiden looks at .obsp['connectivities'] for connectivities (default storage place for pp.neighbors). If specified, leiden looks at .obsp[.uns[neighbors_key]['connectivities_key']] for connectivities.	`None`
`obsp`	`str \| None`	Use .obsp[obsp] as adjacency. You can't specify both `obsp` and `neighbors_key` at the same time.	`None`
`copy`	`bool`	Whether to copy `adata` or modify it inplace.	`False`
`flavor`	`Literal['leidenalg', 'igraph']`	Which package's implementation to use.	`'leidenalg'`
`**clustering_args`		Any further arguments to pass to :func:`~leidenalg.find_partition` (which in turn passes arguments to the `partition_type`) or :meth:`igraph.Graph.community_leiden` from `igraph`.	`{}`

Returns:

Type	Description
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:
`adata.obs['leiden' \| key_added]` : :class:`pandas.Series` (dtype ``category``)	Array of dim (number of samples) that stores the subgroup id (`'0'`, `'1'`, ...) for each cell.
`adata.uns['leiden' \| key_added]['params']` : :class:`dict`	A dict with the values for the parameters `resolution`, `random_state`, and `n_iterations`.

Source code in .venv/lib/python3.10/site-packages/scanpy/tools/_leiden.py

def leiden(  # noqa: PLR0912, PLR0913, PLR0915
    adata: AnnData,
    resolution: float = 1,
    *,
    restrict_to: tuple[str, Sequence[str]] | None = None,
    random_state: _LegacyRandom = 0,
    key_added: str = "leiden",
    adjacency: CSBase | None = None,
    directed: bool | None = None,
    use_weights: bool = True,
    n_iterations: int = -1,
    partition_type: type[MutableVertexPartition] | None = None,
    neighbors_key: str | None = None,
    obsp: str | None = None,
    copy: bool = False,
    flavor: Literal["leidenalg", "igraph"] = "leidenalg",
    **clustering_args,
) -> AnnData | None:
    """Cluster cells into subgroups :cite:p:`Traag2019`.

    Cluster cells using the Leiden algorithm :cite:p:`Traag2019`,
    an improved version of the Louvain algorithm :cite:p:`Blondel2008`.
    It was proposed for single-cell analysis by :cite:t:`Levine2015`.

    This requires having run :func:`~scanpy.pp.neighbors` or
    :func:`~scanpy.external.pp.bbknn` first.

    Parameters
    ----------
    adata
        The annotated data matrix.
    resolution
        A parameter value controlling the coarseness of the clustering.
        Higher values lead to more clusters.
        Set to `None` if overriding `partition_type`
        to one that doesn’t accept a `resolution_parameter`.
    random_state
        Change the initialization of the optimization.
    restrict_to
        Restrict the clustering to the categories within the key for sample
        annotation, tuple needs to contain `(obs_key, list_of_categories)`.
    key_added
        `adata.obs` key under which to add the cluster labels.
    adjacency
        Sparse adjacency matrix of the graph, defaults to neighbors connectivities.
    directed
        Whether to treat the graph as directed or undirected.
    use_weights
        If `True`, edge weights from the graph are used in the computation
        (placing more emphasis on stronger edges).
    n_iterations
        How many iterations of the Leiden clustering algorithm to perform.
        Positive values above 2 define the total number of iterations to perform,
        -1 has the algorithm run until it reaches its optimal clustering.
        2 is faster and the default for underlying packages.
    partition_type
        Type of partition to use.
        Defaults to :class:`~leidenalg.RBConfigurationVertexPartition`.
        For the available options, consult the documentation for
        :func:`~leidenalg.find_partition`.
    neighbors_key
        Use neighbors connectivities as adjacency.
        If not specified, leiden looks at .obsp['connectivities'] for connectivities
        (default storage place for pp.neighbors).
        If specified, leiden looks at
        .obsp[.uns[neighbors_key]['connectivities_key']] for connectivities.
    obsp
        Use .obsp[obsp] as adjacency. You can't specify both
        `obsp` and `neighbors_key` at the same time.
    copy
        Whether to copy `adata` or modify it inplace.
    flavor
        Which package's implementation to use.
    **clustering_args
        Any further arguments to pass to :func:`~leidenalg.find_partition` (which in turn passes arguments to the `partition_type`)
        or :meth:`igraph.Graph.community_leiden` from `igraph`.

    Returns
    -------
    Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:

    `adata.obs['leiden' | key_added]` : :class:`pandas.Series` (dtype ``category``)
        Array of dim (number of samples) that stores the subgroup id
        (``'0'``, ``'1'``, ...) for each cell.

    `adata.uns['leiden' | key_added]['params']` : :class:`dict`
        A dict with the values for the parameters `resolution`, `random_state`,
        and `n_iterations`.

    """
    if flavor not in {"igraph", "leidenalg"}:
        msg = (
            f"flavor must be either 'igraph' or 'leidenalg', but {flavor!r} was passed"
        )
        raise ValueError(msg)
    _utils.ensure_igraph()
    if flavor == "igraph":
        if directed:
            msg = "Cannot use igraph’s leiden implementation with a directed graph."
            raise ValueError(msg)
        if partition_type is not None:
            msg = "Do not pass in partition_type argument when using igraph."
            raise ValueError(msg)
    else:
        try:
            import leidenalg

            msg = 'In the future, the default backend for leiden will be igraph instead of leidenalg.\n\n To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph\'s implementation.'
            _utils.warn_once(msg, FutureWarning, stacklevel=3)
        except ImportError as e:
            msg = "Please install the leiden algorithm: `conda install -c conda-forge leidenalg` or `pip3 install leidenalg`."
            raise ImportError(msg) from e
    clustering_args = dict(clustering_args)

    start = logg.info("running Leiden clustering")
    adata = adata.copy() if copy else adata
    # are we clustering a user-provided graph or the default AnnData one?
    if adjacency is None:
        adjacency = _utils._choose_graph(adata, obsp, neighbors_key)
    if restrict_to is not None:
        restrict_key, restrict_categories = restrict_to
        adjacency, restrict_indices = restrict_adjacency(
            adata,
            restrict_key,
            restrict_categories=restrict_categories,
            adjacency=adjacency,
        )
    # Prepare find_partition arguments as a dictionary,
    # appending to whatever the user provided. It needs to be this way
    # as this allows for the accounting of a None resolution
    # (in the case of a partition variant that doesn't take it on input)
    clustering_args["n_iterations"] = n_iterations
    if flavor == "leidenalg":
        if resolution is not None:
            clustering_args["resolution_parameter"] = resolution
        directed = True if directed is None else directed
        g = _utils.get_igraph_from_adjacency(adjacency, directed=directed)
        if partition_type is None:
            partition_type = leidenalg.RBConfigurationVertexPartition
        if use_weights:
            clustering_args["weights"] = np.array(g.es["weight"]).astype(np.float64)
        clustering_args["seed"] = random_state
        part = leidenalg.find_partition(g, partition_type, **clustering_args)
    else:
        g = _utils.get_igraph_from_adjacency(adjacency, directed=False)
        if use_weights:
            clustering_args["weights"] = "weight"
        if resolution is not None:
            clustering_args["resolution"] = resolution
        clustering_args.setdefault("objective_function", "modularity")
        with set_igraph_random_state(random_state):
            part = g.community_leiden(**clustering_args)
    # store output into adata.obs
    groups = np.array(part.membership)
    if restrict_to is not None:
        if key_added == "leiden":
            key_added += "_R"
        groups = rename_groups(
            adata,
            key_added=key_added,
            restrict_key=restrict_key,
            restrict_categories=restrict_categories,
            restrict_indices=restrict_indices,
            groups=groups,
        )
    adata.obs[key_added] = pd.Categorical(
        values=groups.astype("U"),
        categories=natsorted(map(str, np.unique(groups))),
    )
    # store information on the clustering parameters
    adata.uns[key_added] = {}
    adata.uns[key_added]["params"] = dict(
        resolution=resolution,
        random_state=random_state,
        n_iterations=n_iterations,
    )
    logg.info(
        "    finished",
        time=start,
        deep=(
            f"found {len(np.unique(groups))} clusters and added\n"
            f"    {key_added!r}, the cluster labels (adata.obs, categorical)"
        ),
    )
    return adata if copy else None

Tools

valency-anndata methods¶

valency_anndata.tools.recipe_polis ¶

valency_anndata.tools.recipe_polis2_statements ¶

valency_anndata.tools.kmeans ¶

valency_anndata.tools.pacmap ¶

valency_anndata.tools.localmap ¶

scanpy methods (inherited)¶

valency_anndata.tools.pca ¶

valency_anndata.tools.tsne ¶

valency_anndata.tools.umap ¶

valency_anndata.tools.leiden ¶

`valency-anndata` methods¶

`scanpy` methods (inherited)¶