Skip to content

Tools

valency-anndata methods

valency_anndata.tools.recipe_polis

recipe_polis(
    adata: AnnData,
    *,
    participant_vote_threshold: int = 7,
    key_added_pca: str = "X_pca_polis",
    key_added_kmeans: str = "kmeans_polis",
    mask_var: str | None = None,
    inplace: bool = True,
)

Projects and clusters participants as of [Small et al., 2021].

Expects sparse vote matrix .X with {+1, 0, -1} and NaN values.

Recipe Steps
  1. Masks out meta and moderated-out statements with zeros.
  2. Imputes missing matrix votes with statement-wise means.
  3. Runs standard PCA on the imputed matrix.
  4. Runs sparsity-aware scaling on PCA projections.
  5. Calculates a participant mask using 7-vote threshold.
  6. On unmasked rows, calculates k-means clustering for 2 ≤ k ≤ 5, selecting the optimal k via silhouette scores.

Parameters:

Name Type Description Default
participant_vote_threshold int

Vote threshold at which each participant will be included in clustering.

7
key_added_pca str

If not specified, the PCA embedding is stored as .obsm['X_pca_polis'], the loadings as .varm['X_pca_polis'], and the PCA parameters in .uns['X_pca_polis']. If specified, all are stored instead at [key_added_pca].

'X_pca_polis'
key_added_kmeans str

.obs key under which to add the cluster labels.

'kmeans_polis'
mask_var str | None

Column name in adata.var to use for masking statements before PCA. If provided, only statements where mask_var is True will be used. If None, uses all statements.

None
inplace bool

Perform computation inplace or return result.

True

Returns:

Type Description
.obsm['X_pca_polis' | key_added]

PCA representation of data.

.varm['X_pca_polis' | key_added]

The principal components containing the loadings.

.uns['X_pca_polis' | key_added]['variance_ratio']

Ratio of explained variance.

.uns['X_pca_polis' | key_added]['variance']

Explained variance, equivalent to the eigenvalues of the covariance matrix.

.obs['kmeans_polis' | key_added]

Array of dim (number of samples) that stores the subgroup id ('0', '1', …) for each cell.

.uns['kmeans_polis' | key_added]['params']

A dict with the values for the k-means parameters.

Examples:

Basic usage:

import valency_anndata as val
adata = val.datasets.aufstehen()
val.tools.recipe_polis(adata)
val.viz.embedding(adata, basis="pca_polis", color="kmeans_polis")

Use with highly variable statement filtering:

import valency_anndata as val
adata = val.datasets.aufstehen()
# First identify highly variable statements
val.preprocessing.highly_variable_statements(adata, n_top_statements=100)
# Run Polis recipe using only highly variable statements for PCA
val.tools.recipe_polis(adata, mask_var="highly_variable")
# Visualize the results
val.viz.embedding(adata, basis="pca_polis", color="kmeans_polis")
Source code in src/valency_anndata/tools/_polis.py
def recipe_polis(
    adata: AnnData,
    *,
    participant_vote_threshold: int = 7,
    key_added_pca: str = "X_pca_polis",
    key_added_kmeans: str = "kmeans_polis",
    mask_var: str | None = None,
    inplace: bool = True,
):
    """
    Projects and clusters participants as of [[Small _et al._,
    2021](http://dx.doi.org/10.6035/recerca.5516)].

    Expects sparse vote matrix [`.X`][anndata.AnnData.X] with `{+1, 0, -1}`
    and `NaN` values.

    Recipe Steps
    ------------

    1. Masks out meta and moderated-out statements with zeros.
    2. Imputes missing matrix votes with statement-wise means.
    3. Runs standard PCA on the imputed matrix.
    4. Runs sparsity-aware scaling on PCA projections.
    5. Calculates a participant mask using 7-vote threshold.
    6. On unmasked rows, calculates k-means clustering for 2 ≤ k ≤ 5,
       selecting the optimal k via silhouette scores.

    Parameters
    ----------

    participant_vote_threshold
        Vote threshold at which each participant will be included in clustering.
    key_added_pca
        If not specified, the PCA embedding is stored as
        [`.obsm`][anndata.AnnData.obsm]`['X_pca_polis']`, the loadings as
        [`.varm`][anndata.AnnData.varm]`['X_pca_polis']`, and the PCA parameters
        in [`.uns`][anndata.AnnData.uns]`['X_pca_polis']`.
        If specified, all are stored instead at `[key_added_pca]`.
    key_added_kmeans
        [`.obs`][anndata.AnnData.obs] key under which to add the cluster labels.
    mask_var
        Column name in `adata.var` to use for masking statements before PCA.
        If provided, only statements where `mask_var` is True will be used.
        If None, uses all statements.
    inplace
        Perform computation inplace or return result.

    Returns
    -------

    .obsm['X_pca_polis' | key_added]
        PCA representation of data.
    .varm['X_pca_polis' | key_added]
        The principal components containing the loadings.
    .uns['X_pca_polis' | key_added]['variance_ratio']
        Ratio of explained variance.
    .uns['X_pca_polis' | key_added]['variance']
        Explained variance, equivalent to the eigenvalues of the covariance matrix.
    .obs['kmeans_polis' | key_added]
        Array of dim (number of samples) that stores the subgroup id ('0', '1', …) for each cell.
    .uns['kmeans_polis' | key_added]['params']
        A dict with the values for the k-means parameters.

    Examples
    --------
    Basic usage:

    ```py
    import valency_anndata as val
    adata = val.datasets.aufstehen()
    val.tools.recipe_polis(adata)
    val.viz.embedding(adata, basis="pca_polis", color="kmeans_polis")
    ```

    Use with highly variable statement filtering:

    ```py
    import valency_anndata as val
    adata = val.datasets.aufstehen()
    # First identify highly variable statements
    val.preprocessing.highly_variable_statements(adata, n_top_statements=100)
    # Run Polis recipe using only highly variable statements for PCA
    val.tools.recipe_polis(adata, mask_var="highly_variable")
    # Visualize the results
    val.viz.embedding(adata, basis="pca_polis", color="kmeans_polis")
    ```

    """
    if not inplace:
        adata = adata.copy()

    # Preconditions
    assert isinstance(adata.X, np.ndarray)

    # 1. Mask statements with zeros
    _zero_mask(
        adata,
        key_added_var_mask="zero_mask",
        key_added_layer="X_masked",
    )

    # 2. Impute
    val.preprocessing.impute(
        adata,
        strategy="mean",
        source_layer="X_masked",
        target_layer="X_masked_imputed_mean",
    )

    # 3. PCA (unscaled)
    pca_kwargs = {
        "layer": "X_masked_imputed_mean",
        "key_added": "X_pca_masked_unscaled",
    }
    if mask_var is None:
        # Explicitly disable highly_variable filtering so PCA doesn't silently
        # filter statements in ways the polis recipe is not expecting.
        pca_kwargs["use_highly_variable"] = False
    else:
        pca_kwargs["mask_var"] = mask_var

    val.tools.pca(adata, **pca_kwargs)

    # 4. Scale PCA using sparsity data
    _sparsity_aware_scaling(
        adata,
        use_rep="X_pca_masked_unscaled",
        key_added=key_added_pca,
    )

    # Create cluster mask for threshold
    _cluster_mask(
        adata,
        participant_vote_threshold=participant_vote_threshold,
        key_added_obs_mask="cluster_mask",
    )

    # 5. KMeans clustering
    val.tools.kmeans(
        adata,
        use_rep=key_added_pca,
        # Force kmeans to only run on first two principle components.
        n_pcs=2,
        k_bounds=(2, 5),
        init="polis",
        mask_obs="cluster_mask",
        key_added=key_added_kmeans,
        inplace=inplace,
    )

    if not inplace:
        return adata

valency_anndata.tools.recipe_polis2_statements

recipe_polis2_statements(
    adata: AnnData,
    *,
    show_progress: bool = False,
    inplace: bool = True,
) -> AnnData | None

Embed and cluster statements (the var axis) using the Polis v2 pipeline.

Reads free-text statement content from .var["content"], produces dense embeddings, projects them to 2-D with UMAP, and attaches a hierarchy of cluster labels — all stored on the var axis so that the results live alongside the statements that produced them.

Requires the optional polis2 dependency group::

pip install valency-anndata[polis2]
Recipe Steps
  1. Embeds each statement's text into a high-dimensional vector space and stores the result in .varm["content_embedding"].
  2. Projects the embeddings to 2-D with UMAP and stores the coordinates in .varm["content_umap"].
  3. Builds a hierarchy of clustering layers (finest → coarsest) and stores them in .varm["evoc_polis2"] (shape n_var × num_layers) with the coarsest layer also surfaced as the categorical column .var["evoc_polis2_top"].

Parameters:

Name Type Description Default
adata AnnData

AnnData object whose .var["content"] column contains the statement text strings.

required
show_progress bool

Show embedding progress bar. When False (the default), warnings and progress output from the model-loading libraries are also suppressed.

False
inplace bool

If True (default), mutate adata and return None. If False, operate on a copy and return it.

True

Returns:

Type Description
Depending on *inplace*, returns ``None`` or the modified ``AnnData``.
.varm['content_embedding']

Dense text embeddings, shape (n_var, embed_dim).

.varm['content_umap']

2-D UMAP projection of the embeddings, shape (n_var, 2).

.varm['evoc_polis2']

Stacked layers of clustering labels, shape (n_var, num_layers). Column 0 is the finest/bottom; column -1 is the coarsest/top. -1 = noise.

.var['evoc_polis2_top']

Categorical column taken from the coarsest clustering layer (i.e. evoc_polis2[:, -1]).

Examples:

adata = val.datasets.polis.chile_protests(translate_to="en")

with val.viz.schematic_diagram(diff_from=adata):
    val.tools.recipe_polis2_statements(adata)

val.viz.embedding(
    # Transpose .var and .obs axes for plotting
    adata.transpose(),
    basis="content_umap",
    color=["evoc_polis2_top", "moderation_state"],
)

Source code in src/valency_anndata/tools/_polis2.py
def recipe_polis2_statements(adata: AnnData, *, show_progress: bool = False, inplace: bool = True) -> AnnData | None:
    """Embed and cluster **statements** (the var axis) using the Polis v2 pipeline.

    Reads free-text statement content from ``.var["content"]``, produces
    dense embeddings, projects them to 2-D with UMAP, and attaches a
    hierarchy of cluster labels — all stored on the **var** axis so that
    the results live alongside the statements that produced them.

    Requires the optional ``polis2`` dependency group::

        pip install valency-anndata[polis2]

    Recipe Steps
    ------------

    1. Embeds each statement's text into a high-dimensional vector space
       and stores the result in ``.varm["content_embedding"]``.
    2. Projects the embeddings to 2-D with UMAP and stores the coordinates
       in ``.varm["content_umap"]``.
    3. Builds a hierarchy of clustering layers (finest → coarsest) and
       stores them in ``.varm["evoc_polis2"]`` (shape ``n_var × num_layers``)
       with the coarsest layer also surfaced as the categorical column
       ``.var["evoc_polis2_top"]``.

    Parameters
    ----------
    adata :
        AnnData object whose ``.var["content"]`` column contains the
        statement text strings.
    show_progress :
        Show embedding progress bar.  When ``False`` (the default),
        warnings and progress output from the model-loading libraries
        are also suppressed.
    inplace :
        If ``True`` (default), mutate *adata* and return ``None``.
        If ``False``, operate on a copy and return it.

    Returns
    -------
    Depending on *inplace*, returns ``None`` or the modified ``AnnData``.

    .varm['content_embedding']
        Dense text embeddings, shape ``(n_var, embed_dim)``.
    .varm['content_umap']
        2-D UMAP projection of the embeddings, shape ``(n_var, 2)``.
    .varm['evoc_polis2']
        Stacked layers of clustering labels, shape ``(n_var, num_layers)``.
        Column 0 is the finest/bottom; column -1 is the coarsest/top.  ``-1`` = noise.
    .var['evoc_polis2_top']
        Categorical column taken from the coarsest clustering layer
        (i.e. ``evoc_polis2[:, -1]``).

    Examples
    --------

    ```py
    adata = val.datasets.polis.chile_protests(translate_to="en")

    with val.viz.schematic_diagram(diff_from=adata):
        val.tools.recipe_polis2_statements(adata)
    ```

    <img src="../../assets/documentation-examples/tools--polis2--schematic.png">

    ```py
    val.viz.embedding(
        # Transpose .var and .obs axes for plotting
        adata.transpose(),
        basis="content_umap",
        color=["evoc_polis2_top", "moderation_state"],
    )
    ```

    <img src="../../assets/documentation-examples/tools--polis2--plot.png">
    """
    if not inplace:
        adata = adata.copy()

    texts = adata.var["content"].tolist()

    # Suppress noisy warnings / loggers from HF Hub, sentence-transformers
    # and umap during model loading, unless the caller opted into progress.
    @contextmanager
    def _no_op() -> Generator[None, None, None]:
        yield

    ctx = _quiet() if not show_progress else _no_op()
    with ctx:
        adata.varm["content_embedding"] = _embed_statements(texts, show_progress=show_progress)
        content_embedding = np.asarray(adata.varm["content_embedding"])
        adata.varm["content_umap"] = _project_umap(content_embedding)
        cluster_layers = _create_cluster_layers(content_embedding)

    adata.varm["evoc_polis2"] = np.array(cluster_layers).T
    adata.var["evoc_polis2_top"] = adata.varm["evoc_polis2"][:, -1]
    adata.var["evoc_polis2_top"] = (
        adata.var["evoc_polis2_top"]
        # -1 = noise/unassigned; convert to NA so scanpy renders as lightgray.
        .where(adata.var["evoc_polis2_top"] != -1)
        # Nullable int so NAs survive; category for discrete colormap.
        .astype("Int64")
        .astype("category")
    )

    if not inplace:
        return adata

valency_anndata.tools.kmeans

kmeans(
    adata: AnnData,
    use_rep: Optional[str] = None,
    n_pcs: Optional[int] = None,
    k_bounds: Optional[Tuple[int, int]] = None,
    init: Literal[
        "k-means++", "random", "polis"
    ] = "k-means++",
    init_centers: Optional[ndarray] = None,
    random_state: Optional[int] = None,
    mask_obs: NDArray[bool_] | str | None = None,
    key_added: str = "kmeans",
    inplace: bool = True,
) -> AnnData | None

Apply BestPolisKMeans clustering to an AnnData object.

Parameters:

Name Type Description Default
adata AnnData

Input data. Must have .X as a numpy array.

required
use_rep Optional[str]

Representation to use for clustering. If None, use 'X_pca' if present in adata.obsm, otherwise fall back to adata.X.

None
n_pcs Optional[int]

Number of dimensions to use from the selected representation. If given, only the first n_pcs columns are used.

None
k_bounds Optional[Tuple[int, int]]

Minimum and maximum number of clusters to try. Defaults to [2, 5].

None
init Literal['k-means++', 'random', 'polis']

Initialization method for KMeans. Defaults to 'k-means++'.

'k-means++'
init_centers Optional[ndarray]

Initial cluster centers to use.

None
random_state Optional[int]

Random seed for reproducibility.

None
mask_obs NDArray[bool_] | str | None

Restrict clustering to a certain set of observations. The mask is specified as a boolean array or a string referring to an array in anndata.AnnData.obs.

None
key_added str

Name of the column to store cluster labels in adata.obs.

'kmeans'
inplace bool

If True, modify adata in place and return None. If False, return a copy with the clustering added.

True

Returns:

Type Description
AnnData or None

Returns a copy if inplace=False, otherwise modifies in place.

Source code in src/valency_anndata/tools/_kmeans.py
def kmeans(
    adata: AnnData,
    use_rep: Optional[str] = None,
    n_pcs: Optional[int] = None,
    k_bounds: Optional[Tuple[int, int]] = None,
    init: Literal["k-means++", "random", "polis"] = "k-means++",
    init_centers: Optional[np.ndarray] = None,
    random_state: Optional[int] = None,
    mask_obs: NDArray[np.bool_] | str | None = None,
    key_added: str = "kmeans",
    inplace: bool = True,
) -> AnnData | None:
    """
    Apply BestPolisKMeans clustering to an AnnData object.

    Parameters
    ----------
    adata :
        Input data. Must have `.X` as a numpy array.
    use_rep
        Representation to use for clustering. If ``None``, use ``'X_pca'`` if
        present in ``adata.obsm``, otherwise fall back to ``adata.X``.
    n_pcs
        Number of dimensions to use from the selected representation. If given,
        only the first ``n_pcs`` columns are used.
    k_bounds :
        Minimum and maximum number of clusters to try. Defaults to [2, 5].
    init :
        Initialization method for KMeans. Defaults to 'k-means++'.
    init_centers :
        Initial cluster centers to use.
    random_state :
        Random seed for reproducibility.
    mask_obs :
        Restrict clustering to a certain set of observations. The mask is
        specified as a boolean array or a string referring to an array in
        [anndata.AnnData.obs][].
    key_added :
        Name of the column to store cluster labels in `adata.obs`.
    inplace :
        If True, modify `adata` in place and return None.
        If False, return a copy with the clustering added.

    Returns
    -------
    AnnData or None
        Returns a copy if `inplace=False`, otherwise modifies in place.
    """
    X = _choose_representation(adata, use_rep=use_rep, n_pcs=n_pcs)

    if not isinstance(X, np.ndarray):
        raise ValueError("Selected representation must be a numpy array.")

    if k_bounds is None:
        k_bounds_list = [2, 5]
    else:
        k_bounds_list = list(k_bounds)

    mask = _check_mask(adata, mask_obs, "obs")
    if mask is None:
        X_cluster = X
    else:
        X_cluster = X[mask]
        if X_cluster.shape[0] == 0:
            raise ValueError("mask_obs excludes all observations.")

    best_kmeans = BestPolisKMeans(
        k_bounds=k_bounds_list,
        init=init,
        init_centers=init_centers,
        random_state=random_state,
    )
    best_kmeans.fit(X_cluster)

    if not best_kmeans.best_estimator_:
        raise RuntimeError("BestPolisKMeans did not find a valid estimator.")

    raw_labels = best_kmeans.best_estimator_.labels_

    if mask is None:
        full_labels = raw_labels
    else:
        # dtype=object keeps labels from casting to float.
        full_labels = np.full(adata.n_obs, np.nan, dtype=object)
        full_labels[mask] = raw_labels

    labels = pd.Categorical(full_labels)

    def _write_kmeans_result(adata_out: AnnData) -> None:
        adata_out.obs[key_added] = labels

        kmeans_params = dict(
            k_bounds=k_bounds_list,
            best_k=best_kmeans.best_k_,
            best_score=best_kmeans.best_score_,
            init=init,
            random_state=random_state,
            use_rep=use_rep,
            n_pcs=n_pcs,
        )

        adata_out.uns[key_added] = {}
        adata_out.uns[key_added]["params"] = kmeans_params

    if inplace:
        _write_kmeans_result(adata)
        return None
    else:
        adata_copy = adata.copy()
        _write_kmeans_result(adata_copy)
        return adata_copy

valency_anndata.tools.pacmap

pacmap(
    adata: AnnData,
    *,
    layer: str = "X_imputed",
    n_neighbors: Optional[int] = None,
    n_components: int = 2,
    mask_var: str | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None

Compute PaCMAP dimensionality reduction.

Parameters:

Name Type Description Default
adata AnnData

AnnData object.

required
layer str

Layer to use for computation. Default is "X_imputed".

'X_imputed'
n_neighbors Optional[int]

Number of neighbors for PaCMAP.

None
n_components int

Number of dimensions for the embedding. Default is 2.

2
mask_var str | None

Column name in adata.var to use for masking variables. If provided, only variables where mask_var is True will be used.

None
key_added str | None

Key under which to store the embedding in adata.obsm. Default is "X_pacmap".

None
copy bool

Return a copy instead of modifying adata in place.

False

Returns:

Type Description
AnnData | None

Returns AnnData if copy=True, otherwise returns None.

Source code in src/valency_anndata/tools/_pacmap.py
def pacmap(
    adata: AnnData,
    *,
    layer: str = "X_imputed",
    n_neighbors: Optional[int] = None,
    n_components: int = 2,
    mask_var: str | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None:
    """
    Compute PaCMAP dimensionality reduction.

    Parameters
    ----------
    adata
        AnnData object.
    layer
        Layer to use for computation. Default is "X_imputed".
    n_neighbors
        Number of neighbors for PaCMAP.
    n_components
        Number of dimensions for the embedding. Default is 2.
    mask_var
        Column name in `adata.var` to use for masking variables.
        If provided, only variables where `mask_var` is True will be used.
    key_added
        Key under which to store the embedding in `adata.obsm`.
        Default is "X_pacmap".
    copy
        Return a copy instead of modifying adata in place.

    Returns
    -------
    AnnData | None
        Returns AnnData if `copy=True`, otherwise returns None.
    """
    adata = adata.copy() if copy else adata

    key_obsm, key_uns = ("X_pacmap", "pacmap") if key_added is None else [key_added] * 2

    start = logg.info("computing PaCMAP")

    from pacmap import PaCMAP

    estimator = PaCMAP(
        n_components=n_components,
        n_neighbors=n_neighbors,
    )

    # Get data from layer, optionally filtering by mask_var
    X = adata.layers[layer]
    if mask_var is not None:
        mask = adata.var[mask_var].values
        X = X[:, mask]

    X_reduced = estimator.fit_transform(X)

    adata.obsm[key_obsm] = X_reduced

    return adata if copy else None

valency_anndata.tools.localmap

localmap(
    adata: AnnData,
    *,
    layer: str = "X_imputed",
    n_neighbors: Optional[int] = None,
    n_components: int = 2,
    mask_var: str | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None

Compute LocalMAP dimensionality reduction.

Parameters:

Name Type Description Default
adata AnnData

AnnData object.

required
layer str

Layer to use for computation. Default is "X_imputed".

'X_imputed'
n_neighbors Optional[int]

Number of neighbors for LocalMAP.

None
n_components int

Number of dimensions for the embedding. Default is 2.

2
mask_var str | None

Column name in adata.var to use for masking variables. If provided, only variables where mask_var is True will be used.

None
key_added str | None

Key under which to store the embedding in adata.obsm. Default is "X_localmap".

None
copy bool

Return a copy instead of modifying adata in place.

False

Returns:

Type Description
AnnData | None

Returns AnnData if copy=True, otherwise returns None.

Source code in src/valency_anndata/tools/_pacmap.py
def localmap(
    adata: AnnData,
    *,
    layer: str = "X_imputed",
    n_neighbors: Optional[int] = None,
    n_components: int = 2,
    mask_var: str | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None:
    """
    Compute LocalMAP dimensionality reduction.

    Parameters
    ----------
    adata
        AnnData object.
    layer
        Layer to use for computation. Default is "X_imputed".
    n_neighbors
        Number of neighbors for LocalMAP.
    n_components
        Number of dimensions for the embedding. Default is 2.
    mask_var
        Column name in `adata.var` to use for masking variables.
        If provided, only variables where `mask_var` is True will be used.
    key_added
        Key under which to store the embedding in `adata.obsm`.
        Default is "X_localmap".
    copy
        Return a copy instead of modifying adata in place.

    Returns
    -------
    AnnData | None
        Returns AnnData if `copy=True`, otherwise returns None.
    """
    adata = adata.copy() if copy else adata

    key_obsm, key_uns = ("X_localmap", "localmap") if key_added is None else [key_added] * 2

    start = logg.info("computing LocalMAP")

    from pacmap import LocalMAP

    estimator = LocalMAP(
        n_components=n_components,
        n_neighbors=n_neighbors,
    )

    # Get data from layer, optionally filtering by mask_var
    X = adata.layers[layer]
    if mask_var is not None:
        mask = adata.var[mask_var].values
        X = X[:, mask]

    X_reduced = estimator.fit_transform(X)

    adata.obsm[key_obsm] = X_reduced

    return adata if copy else None

scanpy methods (inherited)

Note

These methods are simply quick convenience wrappers around methods in scanpy, a tool for single-cell gene expression. They will use terms like "cells", "genes" and "counts", but you can think of these as "participants", "statements" and "votes".

See scanpy.tl for more methods you can experiment with via the val.scanpy.tl namespace.

valency_anndata.tools.pca

pca(
    data: AnnData | ndarray | CSBase,
    n_comps: int | None = None,
    *,
    layer: str | None = None,
    zero_center: bool = True,
    svd_solver: SvdSolver | None = None,
    chunked: bool = False,
    chunk_size: int | None = None,
    random_state: _LegacyRandom = 0,
    return_info: bool = False,
    mask_var: NDArray[bool_] | str | None | Empty = _empty,
    use_highly_variable: bool | None = None,
    dtype: DTypeLike = "float32",
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | ndarray | CSBase | None

Principal component analysis :cite:p:Pedregosa2011.

Computes PCA coordinates, loadings and variance decomposition. Uses the following implementations (and defaults for svd_solver):

.. list-table:: :header-rows: 1 :stub-columns: 1

  • -
    • :class:~numpy.ndarray, :class:~scipy.sparse.spmatrix, or :class:~scipy.sparse.sparray
    • :class:dask.array.Array
    • chunked=False, zero_center=True
    • sklearn :class:~sklearn.decomposition.PCA ('arpack')
      • dense: dask-ml :class:~dask_ml.decomposition.PCA\ [#high-mem]_ ('auto')
    • sparse or svd_solver='covariance_eigh': custom implementation ('covariance_eigh')
    • chunked=False, zero_center=False
    • sklearn :class:~sklearn.decomposition.TruncatedSVD ('randomized')
    • dask-ml :class:~dask_ml.decomposition.TruncatedSVD\ [#dense-only]_ ('tsqr')
    • chunked=True (zero_center ignored)
    • sklearn :class:~sklearn.decomposition.IncrementalPCA ('auto')
    • dask-ml :class:~dask_ml.decomposition.IncrementalPCA\ [#densifies]_ ('auto')

.. [#high-mem] Consider svd_solver='covariance_eigh' to reduce memory usage (see :issue:dask/dask-ml#985). .. [#dense-only] This implementation can not handle sparse chunks, try manually densifying them. .. [#densifies] This implementation densifies sparse chunks and therefore has increased memory usage.

Parameters:

Name Type Description Default
data AnnData | ndarray | CSBase

The (annotated) data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

required
n_comps int | None

Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.

None
layer str | None

If provided, which element of layers to use for PCA.

None
zero_center bool

If True, compute (or approximate) PCA from covariance matrix. If False, performa a truncated SVD instead of PCA.

Our default PCA algorithms (see svd_solver) support implicit zero-centering, and therefore efficiently operating on sparse data.

True
svd_solver SvdSolver | None

SVD solver to use. See table above to see which solver class is used based on chunked and zero_center, as well as the default solver for each class when svd_solver=None.

Efficient computation of the principal components of a sparse matrix currently only works with the 'arpack' or 'covariance_eigh' solver.

None Choose automatically based on solver class (see table above). 'arpack' ARPACK wrapper in SciPy (:func:~scipy.sparse.linalg.svds). Not available for dask arrays. 'covariance_eigh' Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices. With dask, array must be CSR or dense and chunked as (N, adata.shape[1]). 'randomized' Randomized algorithm from :cite:t:Halko2009. For dask arrays, this will use :func:~dask.array.linalg.svd_compressed. 'auto' Choose automatically depending on the size of the problem: Will use 'full' for small shapes and 'randomized' for large shapes. 'tsqr' “tall-and-skinny QR” algorithm from :cite:t:Benson2013. Only available for dense dask arrays.

.. versionchanged:: 1.9.3 Default value changed from 'arpack' to None. .. versionchanged:: 1.4.5 Default value changed from 'auto' to 'arpack'.

None
chunked bool

If True, perform an incremental PCA on segments of chunk_size. Automatically zero centers and ignores settings of zero_center, random_seed and svd_solver. If False, perform a full PCA/truncated SVD (see svd_solver and zero_center). See table above for which solver class is used.

False
chunk_size int | None

Number of observations to include in each chunk. Required if chunked=True was passed.

None
random_state _LegacyRandom

Change to use different initial states for the optimization.

0
return_info bool

Only relevant when not passing an :class:~anndata.AnnData: see “Returns”.

False
layer str | None

Layer of adata to use as expression values.

None
dtype DTypeLike

Numpy data type string to which to convert the result.

'float32'
key_added str | None

If not specified, the embedding is stored as :attr:~anndata.AnnData.obsm\ ['X_pca'], the loadings as :attr:~anndata.AnnData.varm\ ['PCs'], and the the parameters in :attr:~anndata.AnnData.uns\ ['pca']. If specified, the embedding is stored as :attr:~anndata.AnnData.obsm\ [key_added], the loadings as :attr:~anndata.AnnData.varm\ [key_added], and the the parameters in :attr:~anndata.AnnData.uns\ [key_added].

None
copy bool

If an :class:~anndata.AnnData is passed, determines whether a copy is returned. Is ignored otherwise.

False

Returns:

Type Description
If `data` is array-like and `return_info=False` was passed,
this function returns the PCA representation of `data` as an
array of the same type as the input array.
Otherwise, it returns `None` if `copy=False`, else an updated `AnnData` object.
Sets the following fields:
`.obsm['X_pca' | key_added]` : :class:`~scipy.sparse.csr_matrix` | :class:`~scipy.sparse.csc_matrix` | :class:`~numpy.ndarray` (shape `(adata.n_obs, n_comps)`)

PCA representation of data.

`.varm['PCs' | key_added]` : :class:`~numpy.ndarray` (shape `(adata.n_vars, n_comps)`)

The principal components containing the loadings.

`.uns['pca' | key_added]['variance_ratio']` : :class:`~numpy.ndarray` (shape `(n_comps,)`)

Ratio of explained variance.

`.uns['pca' | key_added]['variance']` : :class:`~numpy.ndarray` (shape `(n_comps,)`)

Explained variance, equivalent to the eigenvalues of the covariance matrix.

Source code in .venv/lib/python3.10/site-packages/scanpy/preprocessing/_pca/__init__.py
@_doc_params(
    mask_var_hvg=doc_mask_var_hvg,
)
def pca(  # noqa: PLR0912, PLR0913, PLR0915
    data: AnnData | np.ndarray | CSBase,
    n_comps: int | None = None,
    *,
    layer: str | None = None,
    zero_center: bool = True,
    svd_solver: SvdSolver | None = None,
    chunked: bool = False,
    chunk_size: int | None = None,
    random_state: _LegacyRandom = 0,
    return_info: bool = False,
    mask_var: NDArray[np.bool_] | str | None | Empty = _empty,
    use_highly_variable: bool | None = None,
    dtype: DTypeLike = "float32",
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | np.ndarray | CSBase | None:
    r"""Principal component analysis :cite:p:`Pedregosa2011`.

    Computes PCA coordinates, loadings and variance decomposition.
    Uses the following implementations (and defaults for `svd_solver`):

    .. list-table::
       :header-rows: 1
       :stub-columns: 1

       - -
         - :class:`~numpy.ndarray`, :class:`~scipy.sparse.spmatrix`, or :class:`~scipy.sparse.sparray`
         - :class:`dask.array.Array`
       - - `chunked=False`, `zero_center=True`
         - sklearn :class:`~sklearn.decomposition.PCA` (`'arpack'`)
         - - *dense*: dask-ml :class:`~dask_ml.decomposition.PCA`\ [#high-mem]_ (`'auto'`)
           - *sparse* or `svd_solver='covariance_eigh'`: custom implementation (`'covariance_eigh'`)
       - - `chunked=False`, `zero_center=False`
         - sklearn :class:`~sklearn.decomposition.TruncatedSVD` (`'randomized'`)
         - dask-ml :class:`~dask_ml.decomposition.TruncatedSVD`\ [#dense-only]_ (`'tsqr'`)
       - - `chunked=True` (`zero_center` ignored)
         - sklearn :class:`~sklearn.decomposition.IncrementalPCA` (`'auto'`)
         - dask-ml :class:`~dask_ml.decomposition.IncrementalPCA`\ [#densifies]_ (`'auto'`)

    .. [#high-mem] Consider `svd_solver='covariance_eigh'` to reduce memory usage (see :issue:`dask/dask-ml#985`).
    .. [#dense-only] This implementation can not handle sparse chunks, try manually densifying them.
    .. [#densifies] This implementation densifies sparse chunks and therefore has increased memory usage.

    Parameters
    ----------
    data
        The (annotated) data matrix of shape `n_obs` × `n_vars`.
        Rows correspond to cells and columns to genes.
    n_comps
        Number of principal components to compute. Defaults to 50,
        or 1 - minimum dimension size of selected representation.
    layer
        If provided, which element of layers to use for PCA.
    zero_center
        If `True`, compute (or approximate) PCA from covariance matrix.
        If `False`, performa a truncated SVD instead of PCA.

        Our default PCA algorithms (see `svd_solver`) support implicit zero-centering,
        and therefore efficiently operating on sparse data.
    svd_solver
        SVD solver to use.
        See table above to see which solver class is used based on `chunked` and `zero_center`,
        as well as the default solver for each class when `svd_solver=None`.

        Efficient computation of the principal components of a sparse matrix
        currently only works with the `'arpack`' or `'covariance_eigh`' solver.

        `None`
            Choose automatically based on solver class (see table above).
        `'arpack'`
            ARPACK wrapper in SciPy (:func:`~scipy.sparse.linalg.svds`).
            Not available for *dask* arrays.
        `'covariance_eigh'`
            Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices.
            With dask, array must be CSR or dense and chunked as `(N, adata.shape[1])`.
        `'randomized'`
            Randomized algorithm from :cite:t:`Halko2009`.
            For *dask* arrays, this will use :func:`~dask.array.linalg.svd_compressed`.
        `'auto'`
            Choose automatically depending on the size of the problem:
            Will use `'full'` for small shapes and `'randomized'` for large shapes.
        `'tsqr'`
            “tall-and-skinny QR” algorithm from :cite:t:`Benson2013`.
            Only available for dense *dask* arrays.

        .. versionchanged:: 1.9.3
           Default value changed from `'arpack'` to None.
        .. versionchanged:: 1.4.5
           Default value changed from `'auto'` to `'arpack'`.
    chunked
        If `True`, perform an incremental PCA on segments of `chunk_size`.
        Automatically zero centers and ignores settings of `zero_center`, `random_seed` and `svd_solver`.
        If `False`, perform a full PCA/truncated SVD (see `svd_solver` and `zero_center`).
        See table above for which solver class is used.
    chunk_size
        Number of observations to include in each chunk.
        Required if `chunked=True` was passed.
    random_state
        Change to use different initial states for the optimization.
    return_info
        Only relevant when not passing an :class:`~anndata.AnnData`:
        see “Returns”.
    {mask_var_hvg}
    layer
        Layer of `adata` to use as expression values.
    dtype
        Numpy data type string to which to convert the result.
    key_added
        If not specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ `['X_pca']`, the loadings as
        :attr:`~anndata.AnnData.varm`\ `['PCs']`, and the the parameters in
        :attr:`~anndata.AnnData.uns`\ `['pca']`.
        If specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ ``[key_added]``, the loadings as
        :attr:`~anndata.AnnData.varm`\ ``[key_added]``, and the the parameters in
        :attr:`~anndata.AnnData.uns`\ ``[key_added]``.
    copy
        If an :class:`~anndata.AnnData` is passed, determines whether a copy
        is returned. Is ignored otherwise.

    Returns
    -------
    If `data` is array-like and `return_info=False` was passed,
    this function returns the PCA representation of `data` as an
    array of the same type as the input array.

    Otherwise, it returns `None` if `copy=False`, else an updated `AnnData` object.
    Sets the following fields:

    `.obsm['X_pca' | key_added]` : :class:`~scipy.sparse.csr_matrix` | :class:`~scipy.sparse.csc_matrix` | :class:`~numpy.ndarray` (shape `(adata.n_obs, n_comps)`)
        PCA representation of data.
    `.varm['PCs' | key_added]` : :class:`~numpy.ndarray` (shape `(adata.n_vars, n_comps)`)
        The principal components containing the loadings.
    `.uns['pca' | key_added]['variance_ratio']` : :class:`~numpy.ndarray` (shape `(n_comps,)`)
        Ratio of explained variance.
    `.uns['pca' | key_added]['variance']` : :class:`~numpy.ndarray` (shape `(n_comps,)`)
        Explained variance, equivalent to the eigenvalues of the
        covariance matrix.

    """
    logg_start = logg.info("computing PCA")
    if layer is not None and chunked:
        # Current chunking implementation relies on pca being called on X
        msg = "Cannot use `layer` and `chunked` at the same time."
        raise NotImplementedError(msg)

    # chunked calculation is not randomized, anyways
    if svd_solver in {"auto", "randomized"} and not chunked:
        logg.info(
            "Note that scikit-learn's randomized PCA might not be exactly "
            "reproducible across different computational platforms. For exact "
            "reproducibility, choose `svd_solver='arpack'`."
        )
    if return_anndata := isinstance(data, AnnData):
        if layer is None and not chunked and is_backed_type(data.X):
            msg = f"PCA is not implemented for matrices of type {type(data.X)} with chunked as False"
            raise NotImplementedError(msg)
        adata = data.copy() if copy else data
    elif pkg_version("anndata") < Version("0.8.0rc1"):
        adata = AnnData(data, dtype=data.dtype)
    else:
        adata = AnnData(data)

    # Unify new mask argument and deprecated use_highly_varible argument
    mask_var_param, mask_var = _handle_mask_var(
        adata, mask_var, use_highly_variable=use_highly_variable
    )
    del use_highly_variable
    adata_comp = adata[:, mask_var] if mask_var is not None else adata

    if n_comps is None:
        min_dim = min(adata_comp.n_vars, adata_comp.n_obs)
        n_comps = min_dim - 1 if min_dim <= settings.N_PCS else settings.N_PCS

    logg.info(f"    with {n_comps=}")

    x = _get_obs_rep(adata_comp, layer=layer)
    if is_backed_type(x) and layer is not None:
        msg = f"PCA is not implemented for matrices of type {type(x)} from layers"
        raise NotImplementedError(msg)
    # See: https://github.com/scverse/scanpy/pull/2816#issuecomment-1932650529
    if (
        pkg_version("anndata") < Version("0.9")
        and mask_var is not None
        and isinstance(x, np.ndarray)
    ):
        warnings.warn(
            "When using a mask parameter with anndata<0.9 on a dense array, the PCA"
            "can have slightly different results due the array being column major "
            "instead of row major.",
            UserWarning,
            stacklevel=2,
        )

    # check_random_state returns a numpy RandomState when passed an int but
    # dask needs an int for random state
    if not isinstance(x, DaskArray):
        random_state = check_random_state(random_state)
    elif not isinstance(random_state, int):
        msg = f"random_state needs to be an int, not a {type(random_state).__name__} when passing a dask array"
        raise TypeError(msg)

    if chunked:
        if (
            not zero_center
            or random_state
            or (svd_solver is not None and svd_solver != "arpack")
        ):
            logg.debug("Ignoring zero_center, random_state, svd_solver")

        incremental_pca_kwargs = dict()
        if isinstance(x, DaskArray):
            from dask.array import zeros
            from dask_ml.decomposition import IncrementalPCA

            incremental_pca_kwargs["svd_solver"] = _handle_dask_ml_args(
                svd_solver, IncrementalPCA
            )
        else:
            from numpy import zeros
            from sklearn.decomposition import IncrementalPCA

        x_pca = zeros((x.shape[0], n_comps), x.dtype)

        pca_ = IncrementalPCA(n_components=n_comps, **incremental_pca_kwargs)

        for chunk, _, _ in adata_comp.chunked_X(chunk_size):
            chunk_dense = chunk.toarray() if isinstance(chunk, CSBase) else chunk
            pca_.partial_fit(chunk_dense)

        for chunk, start, end in adata_comp.chunked_X(chunk_size):
            chunk_dense = chunk.toarray() if isinstance(chunk, CSBase) else chunk
            x_pca[start:end] = pca_.transform(chunk_dense)
    elif zero_center:
        if isinstance(x, CSBase) and (
            pkg_version("scikit-learn") < Version("1.4") or svd_solver == "lobpcg"
        ):
            if svd_solver not in (
                {"lobpcg"} | get_literal_vals(SvdSolvPCASparseSklearn)
            ):
                if svd_solver is not None:
                    msg = (
                        f"Ignoring {svd_solver=} and using 'arpack', "
                        "sparse PCA with sklearn < 1.4 only supports 'lobpcg' and 'arpack'."
                    )
                    warnings.warn(msg, UserWarning, stacklevel=2)
                svd_solver = "arpack"
            elif svd_solver == "lobpcg":
                msg = (
                    f"{svd_solver=} for sparse relies on legacy code and will not be supported in the future. "
                    "Also the lobpcg solver has been observed to be inaccurate. Please use 'arpack' instead."
                )
                warnings.warn(msg, FutureWarning, stacklevel=2)
            x_pca, pca_ = _pca_compat_sparse(
                x, n_comps, solver=svd_solver, random_state=random_state
            )
        else:
            if not isinstance(x, DaskArray):
                from sklearn.decomposition import PCA

                svd_solver = _handle_sklearn_args(
                    svd_solver, PCA, sparse=isinstance(x, CSBase)
                )
                pca_ = PCA(
                    n_components=n_comps,
                    svd_solver=svd_solver,
                    random_state=random_state,
                )
            elif isinstance(x._meta, CSBase) or svd_solver == "covariance_eigh":
                from ._dask import PCAEighDask

                if random_state != 0:
                    msg = f"Ignoring {random_state=} when using a sparse dask array"
                    warnings.warn(msg, UserWarning, stacklevel=2)
                if svd_solver not in {None, "covariance_eigh"}:
                    msg = f"Ignoring {svd_solver=} when using a sparse dask array"
                    warnings.warn(msg, UserWarning, stacklevel=2)
                pca_ = PCAEighDask(n_components=n_comps)
            else:
                from dask_ml.decomposition import PCA

                svd_solver = _handle_dask_ml_args(svd_solver, PCA)
                pca_ = PCA(
                    n_components=n_comps,
                    svd_solver=svd_solver,
                    random_state=random_state,
                )
            x_pca = pca_.fit_transform(x)
    else:
        if isinstance(x, DaskArray):
            if isinstance(x._meta, CSBase):
                msg = (
                    "`zero_center=False` is not supported for sparse Dask arrays (yet). "
                    "See <https://github.com/dask/dask-ml/issues/123>."
                )
                raise TypeError(msg)
            from dask_ml.decomposition import TruncatedSVD

            svd_solver = _handle_dask_ml_args(svd_solver, TruncatedSVD)
        else:
            from sklearn.decomposition import TruncatedSVD

            svd_solver = _handle_sklearn_args(svd_solver, TruncatedSVD)

        logg.debug(
            "    without zero-centering: \n"
            "    the explained variance does not correspond to the exact statistical definition\n"
            "    the first component, e.g., might be heavily influenced by different means\n"
            "    the following components often resemble the exact PCA very closely"
        )
        pca_ = TruncatedSVD(
            n_components=n_comps, random_state=random_state, algorithm=svd_solver
        )
        x_pca = pca_.fit_transform(x)

    if x_pca.dtype.descr != np.dtype(dtype).descr:
        x_pca = x_pca.astype(dtype)

    if return_anndata:
        key_obsm, key_varm, key_uns = (
            ("X_pca", "PCs", "pca") if key_added is None else [key_added] * 3
        )
        adata.obsm[key_obsm] = x_pca

        if mask_var is not None:
            adata.varm[key_varm] = np.zeros(shape=(adata.n_vars, n_comps))
            adata.varm[key_varm][mask_var] = pca_.components_.T
        else:
            adata.varm[key_varm] = pca_.components_.T

        params = dict(
            zero_center=zero_center,
            use_highly_variable=mask_var_param == "highly_variable",
            mask_var=mask_var_param,
        )
        if layer is not None:
            params["layer"] = layer
        adata.uns[key_uns] = dict(
            params=params,
            variance=pca_.explained_variance_,
            variance_ratio=pca_.explained_variance_ratio_,
        )

        logg.info("    finished", time=logg_start)
        logg.debug(
            "and added\n"
            f"    {key_obsm!r}, the PCA coordinates (adata.obs)\n"
            f"    {key_varm!r}, the loadings (adata.varm)\n"
            f"    'pca_variance', the variance / eigenvalues (adata.uns[{key_uns!r}])\n"
            f"    'pca_variance_ratio', the variance ratio (adata.uns[{key_uns!r}])"
        )
        return adata if copy else None
    else:
        logg.info("    finished", time=logg_start)
        if return_info:
            return (
                x_pca,
                pca_.components_,
                pca_.explained_variance_ratio_,
                pca_.explained_variance_,
            )
        else:
            return x_pca

valency_anndata.tools.tsne

tsne(
    adata: AnnData,
    n_pcs: int | None = None,
    *,
    use_rep: str | None = None,
    perplexity: float = 30,
    metric: str = "euclidean",
    early_exaggeration: float = 12,
    learning_rate: float = 1000,
    random_state: _LegacyRandom = 0,
    use_fast_tsne: bool = False,
    n_jobs: int | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None

t-SNE :cite:p:vanDerMaaten2008,Amir2013,Pedregosa2011.

t-distributed stochastic neighborhood embedding (tSNE, :cite:t:vanDerMaaten2008) was proposed for visualizating single-cell data by :cite:t:Amir2013. Here, by default, we use the implementation of scikit-learn :cite:p:Pedregosa2011. You can achieve a huge speedup and better convergence if you install Multicore-tSNE_ by :cite:t:Ulyanov2016, which will be automatically detected by Scanpy.

.. _multicore-tsne: https://github.com/DmitryUlyanov/Multicore-TSNE

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix.

required
perplexity float

The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter.

30
metric str

Distance metric calculate neighbors on.

'euclidean'
early_exaggeration float

Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high.

12
learning_rate float

Note that the R-package "Rtsne" uses a default of 200. The learning rate can be a critical parameter. It should be between 100 and 1000. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. If the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes.

1000
random_state _LegacyRandom

Change this to use different intial states for the optimization. If None, the initial state is not reproducible.

0
n_jobs int | None

Number of jobs for parallel computation. None means using :attr:scanpy.settings.n_jobs.

None
key_added str | None

If not specified, the embedding is stored as :attr:~anndata.AnnData.obsm\ ['X_tsne'] and the the parameters in :attr:~anndata.AnnData.uns\ ['tsne']. If specified, the embedding is stored as :attr:~anndata.AnnData.obsm\ [key_added] and the the parameters in :attr:~anndata.AnnData.uns\ [key_added].

None
copy bool

Return a copy instead of writing to adata.

False

Returns:

Type Description
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:
`adata.obsm['X_tsne' | key_added]` : :class:`numpy.ndarray` (dtype `float`)

tSNE coordinates of data.

`adata.uns['tsne' | key_added]` : :class:`dict`

tSNE parameters.

Source code in .venv/lib/python3.10/site-packages/scanpy/tools/_tsne.py
@old_positionals(
    "use_rep",
    "perplexity",
    "early_exaggeration",
    "learning_rate",
    "random_state",
    "use_fast_tsne",
    "n_jobs",
    "copy",
)
@_doc_params(doc_n_pcs=doc_n_pcs, use_rep=doc_use_rep)
def tsne(  # noqa: PLR0913
    adata: AnnData,
    n_pcs: int | None = None,
    *,
    use_rep: str | None = None,
    perplexity: float = 30,
    metric: str = "euclidean",
    early_exaggeration: float = 12,
    learning_rate: float = 1000,
    random_state: _LegacyRandom = 0,
    use_fast_tsne: bool = False,
    n_jobs: int | None = None,
    key_added: str | None = None,
    copy: bool = False,
) -> AnnData | None:
    r"""t-SNE :cite:p:`vanDerMaaten2008,Amir2013,Pedregosa2011`.

    t-distributed stochastic neighborhood embedding (tSNE, :cite:t:`vanDerMaaten2008`) was
    proposed for visualizating single-cell data by :cite:t:`Amir2013`. Here, by default,
    we use the implementation of *scikit-learn* :cite:p:`Pedregosa2011`. You can achieve
    a huge speedup and better convergence if you install Multicore-tSNE_
    by :cite:t:`Ulyanov2016`, which will be automatically detected by Scanpy.

    .. _multicore-tsne: https://github.com/DmitryUlyanov/Multicore-TSNE

    Parameters
    ----------
    adata
        Annotated data matrix.
    {doc_n_pcs}
    {use_rep}
    perplexity
        The perplexity is related to the number of nearest neighbors that
        is used in other manifold learning algorithms. Larger datasets
        usually require a larger perplexity. Consider selecting a value
        between 5 and 50. The choice is not extremely critical since t-SNE
        is quite insensitive to this parameter.
    metric
        Distance metric calculate neighbors on.
    early_exaggeration
        Controls how tight natural clusters in the original space are in the
        embedded space and how much space will be between them. For larger
        values, the space between natural clusters will be larger in the
        embedded space. Again, the choice of this parameter is not very
        critical. If the cost function increases during initial optimization,
        the early exaggeration factor or the learning rate might be too high.
    learning_rate
        Note that the R-package "Rtsne" uses a default of 200.
        The learning rate can be a critical parameter. It should be
        between 100 and 1000. If the cost function increases during initial
        optimization, the early exaggeration factor or the learning rate
        might be too high. If the cost function gets stuck in a bad local
        minimum increasing the learning rate helps sometimes.
    random_state
        Change this to use different intial states for the optimization.
        If `None`, the initial state is not reproducible.
    n_jobs
        Number of jobs for parallel computation.
        `None` means using :attr:`scanpy.settings.n_jobs`.
    key_added
        If not specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ `['X_tsne']` and the the parameters in
        :attr:`~anndata.AnnData.uns`\ `['tsne']`.
        If specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ ``[key_added]`` and the the parameters in
        :attr:`~anndata.AnnData.uns`\ ``[key_added]``.
    copy
        Return a copy instead of writing to `adata`.

    Returns
    -------
    Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:

    `adata.obsm['X_tsne' | key_added]` : :class:`numpy.ndarray` (dtype `float`)
        tSNE coordinates of data.
    `adata.uns['tsne' | key_added]` : :class:`dict`
        tSNE parameters.

    """
    start = logg.info("computing tSNE")
    adata = adata.copy() if copy else adata
    x = _choose_representation(adata, use_rep=use_rep, n_pcs=n_pcs)
    raise_not_implemented_error_if_backed_type(x, "tsne")
    # params for sklearn
    n_jobs = settings.n_jobs if n_jobs is None else n_jobs
    params_sklearn = dict(
        perplexity=perplexity,
        random_state=random_state,
        verbose=settings.verbosity > 3,
        early_exaggeration=early_exaggeration,
        learning_rate=learning_rate,
        n_jobs=n_jobs,
        metric=metric,
    )
    if metric != "euclidean" and (pkg_version("scikit-learn") < Version("1.3.0rc1")):
        params_sklearn["square_distances"] = True

    # Backwards compat handling: Remove in scanpy 1.9.0
    if n_jobs != 1 and not use_fast_tsne:
        warnings.warn(
            "In previous versions of scanpy, calling tsne with n_jobs > 1 would use "
            "MulticoreTSNE. Now this uses the scikit-learn version of TSNE by default. "
            "If you'd like the old behaviour (which is deprecated), pass "
            "'use_fast_tsne=True'. Note, MulticoreTSNE is not actually faster anymore.",
            UserWarning,
            stacklevel=2,
        )
    if use_fast_tsne:
        warnings.warn(
            "Argument `use_fast_tsne` is deprecated, and support for MulticoreTSNE "
            "will be dropped in a future version of scanpy.",
            FutureWarning,
            stacklevel=2,
        )

    # deal with different tSNE implementations
    if use_fast_tsne:
        try:
            from MulticoreTSNE import MulticoreTSNE as TSNE  # noqa: N814

            tsne = TSNE(**params_sklearn)
            logg.info("    using the 'MulticoreTSNE' package by Ulyanov (2017)")
            # need to transform to float64 for MulticoreTSNE...
            x_tsne = tsne.fit_transform(x.astype("float64"))
        except ImportError:
            use_fast_tsne = False
            warnings.warn(
                "Could not import 'MulticoreTSNE'. Falling back to scikit-learn.",
                UserWarning,
                stacklevel=2,
            )
    if use_fast_tsne is False:  # In case MultiCore failed to import
        from sklearn.manifold import TSNE

        # unfortunately, sklearn does not allow to set a minimum number
        # of iterations for barnes-hut tSNE
        tsne = TSNE(**params_sklearn)
        logg.info("    using sklearn.manifold.TSNE")
        x_tsne = tsne.fit_transform(x)

    # update AnnData instance
    params = dict(
        perplexity=perplexity,
        early_exaggeration=early_exaggeration,
        learning_rate=learning_rate,
        n_jobs=n_jobs,
        metric=metric,
        use_rep=use_rep,
    )
    key_uns, key_obsm = ("tsne", "X_tsne") if key_added is None else [key_added] * 2
    adata.obsm[key_obsm] = x_tsne  # annotate samples with tSNE coordinates
    adata.uns[key_uns] = dict(params={k: v for k, v in params.items() if v is not None})

    logg.info(
        "    finished",
        time=start,
        deep=(
            f"added\n"
            f"    {key_obsm!r}, tSNE coordinates (adata.obsm)\n"
            f"    {key_uns!r}, tSNE parameters (adata.uns)"
        ),
    )

    return adata if copy else None

valency_anndata.tools.umap

umap(
    adata: AnnData,
    *,
    min_dist: float = 0.5,
    spread: float = 1.0,
    n_components: int = 2,
    maxiter: int | None = None,
    alpha: float = 1.0,
    gamma: float = 1.0,
    negative_sample_rate: int = 5,
    init_pos: _InitPos | ndarray | None = "spectral",
    random_state: _LegacyRandom = 0,
    a: float | None = None,
    b: float | None = None,
    method: Literal["umap", "rapids"] = "umap",
    key_added: str | None = None,
    neighbors_key: str = "neighbors",
    copy: bool = False,
) -> AnnData | None

Embed the neighborhood graph using UMAP :cite:p:McInnes2018.

UMAP (Uniform Manifold Approximation and Projection) is a manifold learning technique suitable for visualizing high-dimensional data. Besides tending to be faster than tSNE, it optimizes the embedding such that it best reflects the topology of the data, which we represent throughout Scanpy using a neighborhood graph. tSNE, by contrast, optimizes the distribution of nearest-neighbor distances in the embedding such that these best match the distribution of distances in the high-dimensional space. We use the implementation of umap-learn_ :cite:p:McInnes2018. For a few comparisons of UMAP with tSNE, see :cite:t:Becht2018.

.. _umap-learn: https://github.com/lmcinnes/umap

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix.

required
min_dist float

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. The default of in the umap-learn package is 0.1.

0.5
spread float

The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

1.0
n_components int

The number of dimensions of the embedding.

2
maxiter int | None

The number of iterations (epochs) of the optimization. Called n_epochs in the original UMAP.

None
alpha float

The initial learning rate for the embedding optimization.

1.0
gamma float

Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.

1.0
negative_sample_rate int

The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding.

5
init_pos _InitPos | ndarray | None

How to initialize the low dimensional embedding. Called init in the original UMAP. Options are:

  • Any key for adata.obsm.
  • 'paga': positions from :func:~scanpy.pl.paga.
  • 'spectral': use a spectral embedding of the graph.
  • 'random': assign initial embedding positions at random.
  • A numpy array of initial embedding positions.
'spectral'
random_state _LegacyRandom

If int, random_state is the seed used by the random number generator; If RandomState or Generator, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

0
a float | None

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

None
b float | None

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

None
method Literal['umap', 'rapids']

Chosen implementation.

'umap' Umap’s simplical set embedding. 'rapids' GPU accelerated implementation.

.. deprecated:: 1.10.0
    Use :func:`rapids_singlecell.tl.umap` instead.
'umap'
key_added str | None

If not specified, the embedding is stored as :attr:~anndata.AnnData.obsm\ ['X_umap'] and the the parameters in :attr:~anndata.AnnData.uns\ ['umap']. If specified, the embedding is stored as :attr:~anndata.AnnData.obsm\ [key_added] and the the parameters in :attr:~anndata.AnnData.uns\ [key_added].

None
neighbors_key str

Umap looks in :attr:~anndata.AnnData.uns\ [neighbors_key] for neighbors settings and :attr:~anndata.AnnData.obsp\ [.uns[neighbors_key]['connectivities_key']] for connectivities.

'neighbors'
copy bool

Return a copy instead of writing to adata.

False

Returns:

Type Description
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:
`adata.obsm['X_umap' | key_added]` : :class:`numpy.ndarray` (dtype `float`)

UMAP coordinates of data.

`adata.uns['umap' | key_added]` : :class:`dict`

UMAP parameters.

Source code in .venv/lib/python3.10/site-packages/scanpy/tools/_umap.py
@old_positionals(
    "min_dist",
    "spread",
    "n_components",
    "maxiter",
    "alpha",
    "gamma",
    "negative_sample_rate",
    "init_pos",
    "random_state",
    "a",
    "b",
    "copy",
    "method",
    "neighbors_key",
)
def umap(  # noqa: PLR0913, PLR0915
    adata: AnnData,
    *,
    min_dist: float = 0.5,
    spread: float = 1.0,
    n_components: int = 2,
    maxiter: int | None = None,
    alpha: float = 1.0,
    gamma: float = 1.0,
    negative_sample_rate: int = 5,
    init_pos: _InitPos | np.ndarray | None = "spectral",
    random_state: _LegacyRandom = 0,
    a: float | None = None,
    b: float | None = None,
    method: Literal["umap", "rapids"] = "umap",
    key_added: str | None = None,
    neighbors_key: str = "neighbors",
    copy: bool = False,
) -> AnnData | None:
    r"""Embed the neighborhood graph using UMAP :cite:p:`McInnes2018`.

    UMAP (Uniform Manifold Approximation and Projection) is a manifold learning
    technique suitable for visualizing high-dimensional data. Besides tending to
    be faster than tSNE, it optimizes the embedding such that it best reflects
    the topology of the data, which we represent throughout Scanpy using a
    neighborhood graph. tSNE, by contrast, optimizes the distribution of
    nearest-neighbor distances in the embedding such that these best match the
    distribution of distances in the high-dimensional space.
    We use the implementation of umap-learn_ :cite:p:`McInnes2018`.
    For a few comparisons of UMAP with tSNE, see :cite:t:`Becht2018`.

    .. _umap-learn: https://github.com/lmcinnes/umap

    Parameters
    ----------
    adata
        Annotated data matrix.
    min_dist
        The effective minimum distance between embedded points. Smaller values
        will result in a more clustered/clumped embedding where nearby points on
        the manifold are drawn closer together, while larger values will result
        on a more even dispersal of points. The value should be set relative to
        the ``spread`` value, which determines the scale at which embedded
        points will be spread out. The default of in the `umap-learn` package is
        0.1.
    spread
        The effective scale of embedded points. In combination with `min_dist`
        this determines how clustered/clumped the embedded points are.
    n_components
        The number of dimensions of the embedding.
    maxiter
        The number of iterations (epochs) of the optimization. Called `n_epochs`
        in the original UMAP.
    alpha
        The initial learning rate for the embedding optimization.
    gamma
        Weighting applied to negative samples in low dimensional embedding
        optimization. Values higher than one will result in greater weight
        being given to negative samples.
    negative_sample_rate
        The number of negative edge/1-simplex samples to use per positive
        edge/1-simplex sample in optimizing the low dimensional embedding.
    init_pos
        How to initialize the low dimensional embedding. Called `init` in the
        original UMAP. Options are:

        * Any key for `adata.obsm`.
        * 'paga': positions from :func:`~scanpy.pl.paga`.
        * 'spectral': use a spectral embedding of the graph.
        * 'random': assign initial embedding positions at random.
        * A numpy array of initial embedding positions.
    random_state
        If `int`, `random_state` is the seed used by the random number generator;
        If `RandomState` or `Generator`, `random_state` is the random number generator;
        If `None`, the random number generator is the `RandomState` instance used
        by `np.random`.
    a
        More specific parameters controlling the embedding. If `None` these
        values are set automatically as determined by `min_dist` and
        `spread`.
    b
        More specific parameters controlling the embedding. If `None` these
        values are set automatically as determined by `min_dist` and
        `spread`.
    method
        Chosen implementation.

        ``'umap'``
            Umap’s simplical set embedding.
        ``'rapids'``
            GPU accelerated implementation.

            .. deprecated:: 1.10.0
                Use :func:`rapids_singlecell.tl.umap` instead.
    key_added
        If not specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ `['X_umap']` and the the parameters in
        :attr:`~anndata.AnnData.uns`\ `['umap']`.
        If specified, the embedding is stored as
        :attr:`~anndata.AnnData.obsm`\ ``[key_added]`` and the the parameters in
        :attr:`~anndata.AnnData.uns`\ ``[key_added]``.
    neighbors_key
        Umap looks in
        :attr:`~anndata.AnnData.uns`\ ``[neighbors_key]`` for neighbors settings and
        :attr:`~anndata.AnnData.obsp`\ ``[.uns[neighbors_key]['connectivities_key']]`` for connectivities.
    copy
        Return a copy instead of writing to adata.

    Returns
    -------
    Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:

    `adata.obsm['X_umap' | key_added]` : :class:`numpy.ndarray` (dtype `float`)
        UMAP coordinates of data.
    `adata.uns['umap' | key_added]` : :class:`dict`
        UMAP parameters.

    """
    adata = adata.copy() if copy else adata

    key_obsm, key_uns = ("X_umap", "umap") if key_added is None else [key_added] * 2

    if neighbors_key is None:  # backwards compat
        neighbors_key = "neighbors"
    if neighbors_key not in adata.uns:
        msg = f"Did not find .uns[{neighbors_key!r}]. Run `sc.pp.neighbors` first."
        raise ValueError(msg)

    start = logg.info("computing UMAP")

    neighbors = NeighborsView(adata, neighbors_key)

    if "params" not in neighbors or neighbors["params"]["method"] != "umap":
        logg.warning(
            f'.obsp["{neighbors["connectivities_key"]}"] have not been computed using umap'
        )

    with warnings.catch_warnings():
        # umap 0.5.0
        warnings.filterwarnings("ignore", message=r"Tensorflow not installed")
        import umap

    from umap.umap_ import find_ab_params, simplicial_set_embedding

    if a is None or b is None:
        a, b = find_ab_params(spread, min_dist)
    adata.uns[key_uns] = dict(params=dict(a=a, b=b))
    if isinstance(init_pos, str) and init_pos in adata.obsm:
        init_coords = adata.obsm[init_pos]
    elif isinstance(init_pos, str) and init_pos == "paga":
        init_coords = get_init_pos_from_paga(
            adata, random_state=random_state, neighbors_key=neighbors_key
        )
    else:
        init_coords = init_pos  # Let umap handle it
    if hasattr(init_coords, "dtype"):
        init_coords = check_array(init_coords, dtype=np.float32, accept_sparse=False)

    if random_state != 0:
        adata.uns[key_uns]["params"]["random_state"] = random_state
    random_state = check_random_state(random_state)

    neigh_params = neighbors["params"]
    x = _choose_representation(
        adata,
        use_rep=neigh_params.get("use_rep", None),
        n_pcs=neigh_params.get("n_pcs", None),
        silent=True,
    )
    if method == "umap":
        # the data matrix X is really only used for determining the number of connected components
        # for the init condition in the UMAP embedding
        default_epochs = 500 if neighbors["connectivities"].shape[0] <= 10000 else 200
        n_epochs = default_epochs if maxiter is None else maxiter
        x_umap, _ = simplicial_set_embedding(
            data=x,
            graph=neighbors["connectivities"].tocoo(),
            n_components=n_components,
            initial_alpha=alpha,
            a=a,
            b=b,
            gamma=gamma,
            negative_sample_rate=negative_sample_rate,
            n_epochs=n_epochs,
            init=init_coords,
            random_state=random_state,
            metric=neigh_params.get("metric", "euclidean"),
            metric_kwds=neigh_params.get("metric_kwds", {}),
            densmap=False,
            densmap_kwds={},
            output_dens=False,
            verbose=settings.verbosity > 3,
        )
    elif method == "rapids":
        msg = (
            "`method='rapids'` is deprecated. "
            "Use `rapids_singlecell.tl.louvain` instead."
        )
        warnings.warn(msg, FutureWarning, stacklevel=2)
        metric = neigh_params.get("metric", "euclidean")
        if metric != "euclidean":
            msg = (
                f"`sc.pp.neighbors` was called with `metric` {metric!r}, "
                "but umap `method` 'rapids' only supports the 'euclidean' metric."
            )
            raise ValueError(msg)
        from cuml import UMAP

        n_neighbors = neighbors["params"]["n_neighbors"]
        n_epochs = (
            500 if maxiter is None else maxiter
        )  # 0 is not a valid value for rapids, unlike original umap
        x_contiguous = np.ascontiguousarray(x, dtype=np.float32)
        umap = UMAP(
            n_neighbors=n_neighbors,
            n_components=n_components,
            n_epochs=n_epochs,
            learning_rate=alpha,
            init=init_pos,
            min_dist=min_dist,
            spread=spread,
            negative_sample_rate=negative_sample_rate,
            a=a,
            b=b,
            verbose=settings.verbosity > 3,
            random_state=random_state,
        )
        x_umap = umap.fit_transform(x_contiguous)
    adata.obsm[key_obsm] = x_umap  # annotate samples with UMAP coordinates
    logg.info(
        "    finished",
        time=start,
        deep=(
            "added\n"
            f"    {key_obsm!r}, UMAP coordinates (adata.obsm)\n"
            f"    {key_uns!r}, UMAP parameters (adata.uns)"
        ),
    )
    return adata if copy else None

valency_anndata.tools.leiden

leiden(
    adata: AnnData,
    resolution: float = 1,
    *,
    restrict_to: tuple[str, Sequence[str]] | None = None,
    random_state: _LegacyRandom = 0,
    key_added: str = "leiden",
    adjacency: CSBase | None = None,
    directed: bool | None = None,
    use_weights: bool = True,
    n_iterations: int = -1,
    partition_type: type[MutableVertexPartition]
    | None = None,
    neighbors_key: str | None = None,
    obsp: str | None = None,
    copy: bool = False,
    flavor: Literal["leidenalg", "igraph"] = "leidenalg",
    **clustering_args,
) -> AnnData | None

Cluster cells into subgroups :cite:p:Traag2019.

Cluster cells using the Leiden algorithm :cite:p:Traag2019, an improved version of the Louvain algorithm :cite:p:Blondel2008. It was proposed for single-cell analysis by :cite:t:Levine2015.

This requires having run :func:~scanpy.pp.neighbors or :func:~scanpy.external.pp.bbknn first.

Parameters:

Name Type Description Default
adata AnnData

The annotated data matrix.

required
resolution float

A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. Set to None if overriding partition_type to one that doesn’t accept a resolution_parameter.

1
random_state _LegacyRandom

Change the initialization of the optimization.

0
restrict_to tuple[str, Sequence[str]] | None

Restrict the clustering to the categories within the key for sample annotation, tuple needs to contain (obs_key, list_of_categories).

None
key_added str

adata.obs key under which to add the cluster labels.

'leiden'
adjacency CSBase | None

Sparse adjacency matrix of the graph, defaults to neighbors connectivities.

None
directed bool | None

Whether to treat the graph as directed or undirected.

None
use_weights bool

If True, edge weights from the graph are used in the computation (placing more emphasis on stronger edges).

True
n_iterations int

How many iterations of the Leiden clustering algorithm to perform. Positive values above 2 define the total number of iterations to perform, -1 has the algorithm run until it reaches its optimal clustering. 2 is faster and the default for underlying packages.

-1
partition_type type[MutableVertexPartition] | None

Type of partition to use. Defaults to :class:~leidenalg.RBConfigurationVertexPartition. For the available options, consult the documentation for :func:~leidenalg.find_partition.

None
neighbors_key str | None

Use neighbors connectivities as adjacency. If not specified, leiden looks at .obsp['connectivities'] for connectivities (default storage place for pp.neighbors). If specified, leiden looks at .obsp[.uns[neighbors_key]['connectivities_key']] for connectivities.

None
obsp str | None

Use .obsp[obsp] as adjacency. You can't specify both obsp and neighbors_key at the same time.

None
copy bool

Whether to copy adata or modify it inplace.

False
flavor Literal['leidenalg', 'igraph']

Which package's implementation to use.

'leidenalg'
**clustering_args

Any further arguments to pass to :func:~leidenalg.find_partition (which in turn passes arguments to the partition_type) or :meth:igraph.Graph.community_leiden from igraph.

{}

Returns:

Type Description
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:
`adata.obs['leiden' | key_added]` : :class:`pandas.Series` (dtype ``category``)

Array of dim (number of samples) that stores the subgroup id ('0', '1', ...) for each cell.

`adata.uns['leiden' | key_added]['params']` : :class:`dict`

A dict with the values for the parameters resolution, random_state, and n_iterations.

Source code in .venv/lib/python3.10/site-packages/scanpy/tools/_leiden.py
def leiden(  # noqa: PLR0912, PLR0913, PLR0915
    adata: AnnData,
    resolution: float = 1,
    *,
    restrict_to: tuple[str, Sequence[str]] | None = None,
    random_state: _LegacyRandom = 0,
    key_added: str = "leiden",
    adjacency: CSBase | None = None,
    directed: bool | None = None,
    use_weights: bool = True,
    n_iterations: int = -1,
    partition_type: type[MutableVertexPartition] | None = None,
    neighbors_key: str | None = None,
    obsp: str | None = None,
    copy: bool = False,
    flavor: Literal["leidenalg", "igraph"] = "leidenalg",
    **clustering_args,
) -> AnnData | None:
    """Cluster cells into subgroups :cite:p:`Traag2019`.

    Cluster cells using the Leiden algorithm :cite:p:`Traag2019`,
    an improved version of the Louvain algorithm :cite:p:`Blondel2008`.
    It was proposed for single-cell analysis by :cite:t:`Levine2015`.

    This requires having run :func:`~scanpy.pp.neighbors` or
    :func:`~scanpy.external.pp.bbknn` first.

    Parameters
    ----------
    adata
        The annotated data matrix.
    resolution
        A parameter value controlling the coarseness of the clustering.
        Higher values lead to more clusters.
        Set to `None` if overriding `partition_type`
        to one that doesn’t accept a `resolution_parameter`.
    random_state
        Change the initialization of the optimization.
    restrict_to
        Restrict the clustering to the categories within the key for sample
        annotation, tuple needs to contain `(obs_key, list_of_categories)`.
    key_added
        `adata.obs` key under which to add the cluster labels.
    adjacency
        Sparse adjacency matrix of the graph, defaults to neighbors connectivities.
    directed
        Whether to treat the graph as directed or undirected.
    use_weights
        If `True`, edge weights from the graph are used in the computation
        (placing more emphasis on stronger edges).
    n_iterations
        How many iterations of the Leiden clustering algorithm to perform.
        Positive values above 2 define the total number of iterations to perform,
        -1 has the algorithm run until it reaches its optimal clustering.
        2 is faster and the default for underlying packages.
    partition_type
        Type of partition to use.
        Defaults to :class:`~leidenalg.RBConfigurationVertexPartition`.
        For the available options, consult the documentation for
        :func:`~leidenalg.find_partition`.
    neighbors_key
        Use neighbors connectivities as adjacency.
        If not specified, leiden looks at .obsp['connectivities'] for connectivities
        (default storage place for pp.neighbors).
        If specified, leiden looks at
        .obsp[.uns[neighbors_key]['connectivities_key']] for connectivities.
    obsp
        Use .obsp[obsp] as adjacency. You can't specify both
        `obsp` and `neighbors_key` at the same time.
    copy
        Whether to copy `adata` or modify it inplace.
    flavor
        Which package's implementation to use.
    **clustering_args
        Any further arguments to pass to :func:`~leidenalg.find_partition` (which in turn passes arguments to the `partition_type`)
        or :meth:`igraph.Graph.community_leiden` from `igraph`.

    Returns
    -------
    Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:

    `adata.obs['leiden' | key_added]` : :class:`pandas.Series` (dtype ``category``)
        Array of dim (number of samples) that stores the subgroup id
        (``'0'``, ``'1'``, ...) for each cell.

    `adata.uns['leiden' | key_added]['params']` : :class:`dict`
        A dict with the values for the parameters `resolution`, `random_state`,
        and `n_iterations`.

    """
    if flavor not in {"igraph", "leidenalg"}:
        msg = (
            f"flavor must be either 'igraph' or 'leidenalg', but {flavor!r} was passed"
        )
        raise ValueError(msg)
    _utils.ensure_igraph()
    if flavor == "igraph":
        if directed:
            msg = "Cannot use igraph’s leiden implementation with a directed graph."
            raise ValueError(msg)
        if partition_type is not None:
            msg = "Do not pass in partition_type argument when using igraph."
            raise ValueError(msg)
    else:
        try:
            import leidenalg

            msg = 'In the future, the default backend for leiden will be igraph instead of leidenalg.\n\n To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph\'s implementation.'
            _utils.warn_once(msg, FutureWarning, stacklevel=3)
        except ImportError as e:
            msg = "Please install the leiden algorithm: `conda install -c conda-forge leidenalg` or `pip3 install leidenalg`."
            raise ImportError(msg) from e
    clustering_args = dict(clustering_args)

    start = logg.info("running Leiden clustering")
    adata = adata.copy() if copy else adata
    # are we clustering a user-provided graph or the default AnnData one?
    if adjacency is None:
        adjacency = _utils._choose_graph(adata, obsp, neighbors_key)
    if restrict_to is not None:
        restrict_key, restrict_categories = restrict_to
        adjacency, restrict_indices = restrict_adjacency(
            adata,
            restrict_key,
            restrict_categories=restrict_categories,
            adjacency=adjacency,
        )
    # Prepare find_partition arguments as a dictionary,
    # appending to whatever the user provided. It needs to be this way
    # as this allows for the accounting of a None resolution
    # (in the case of a partition variant that doesn't take it on input)
    clustering_args["n_iterations"] = n_iterations
    if flavor == "leidenalg":
        if resolution is not None:
            clustering_args["resolution_parameter"] = resolution
        directed = True if directed is None else directed
        g = _utils.get_igraph_from_adjacency(adjacency, directed=directed)
        if partition_type is None:
            partition_type = leidenalg.RBConfigurationVertexPartition
        if use_weights:
            clustering_args["weights"] = np.array(g.es["weight"]).astype(np.float64)
        clustering_args["seed"] = random_state
        part = leidenalg.find_partition(g, partition_type, **clustering_args)
    else:
        g = _utils.get_igraph_from_adjacency(adjacency, directed=False)
        if use_weights:
            clustering_args["weights"] = "weight"
        if resolution is not None:
            clustering_args["resolution"] = resolution
        clustering_args.setdefault("objective_function", "modularity")
        with set_igraph_random_state(random_state):
            part = g.community_leiden(**clustering_args)
    # store output into adata.obs
    groups = np.array(part.membership)
    if restrict_to is not None:
        if key_added == "leiden":
            key_added += "_R"
        groups = rename_groups(
            adata,
            key_added=key_added,
            restrict_key=restrict_key,
            restrict_categories=restrict_categories,
            restrict_indices=restrict_indices,
            groups=groups,
        )
    adata.obs[key_added] = pd.Categorical(
        values=groups.astype("U"),
        categories=natsorted(map(str, np.unique(groups))),
    )
    # store information on the clustering parameters
    adata.uns[key_added] = {}
    adata.uns[key_added]["params"] = dict(
        resolution=resolution,
        random_state=random_state,
        n_iterations=n_iterations,
    )
    logg.info(
        "    finished",
        time=start,
        deep=(
            f"found {len(np.unique(groups))} clusters and added\n"
            f"    {key_added!r}, the cluster labels (adata.obs, categorical)"
        ),
    )
    return adata if copy else None