Preprocessing
valency-anndata methods¶
valency_anndata.preprocessing.rebuild_vote_matrix ¶
rebuild_vote_matrix(
data: AnnData,
trim_rule: int | float | str | datetime = 1.0,
time_col: str = "timestamp",
inplace: bool = True,
) -> Optional[AnnData]
Rebuild a vote matrix from votes stored in adata.uns['votes'].
- Trims votes by time according to
trim_rule. - Deduplicates votes by keeping the last vote per voter-comment pair.
- Returns a new AnnData with
.obs= voters,.var= comments,.X= vote values. - Preserves existing
uns,obsm, andlayers.
Source code in src/valency_anndata/preprocessing/_rebuild_vote_matrix.py
valency_anndata.preprocessing.calculate_qc_metrics ¶
calculate_qc_metrics(
adata: AnnData, *, inplace: bool = False
) -> Optional[Tuple[DataFrame, DataFrame]]
Compute participant- and statement-level metrics using describe_obs and describe_var.
Source code in src/valency_anndata/preprocessing/_qc.py
valency_anndata.preprocessing.impute ¶
impute(
adata: AnnData,
*,
strategy: Literal["zero", "mean", "median"] = "mean",
source_layer: Optional[str] = None,
target_layer: Optional[str] = None,
overwrite: bool = False,
) -> None
Impute NaN values in an AnnData matrix and store the result in a layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData object. |
required |
strategy
|
Literal['zero', 'mean', 'median']
|
Imputation strategy. Currently supports: - "zero": replace NaNs with 0 - "mean": column-wise mean - "median": column-wise median |
'mean'
|
source_layer
|
Optional[str]
|
Layer to read from. If None, uses adata.X. |
None
|
target_layer
|
Optional[str]
|
Layer to write to. Defaults to "X_imputed_ |
None
|
overwrite
|
bool
|
Whether to overwrite an existing target layer. |
False
|
Source code in src/valency_anndata/preprocessing/_impute.py
valency_anndata.preprocessing.highly_variable_statements ¶
highly_variable_statements(
adata: AnnData,
*,
layer: str | None = None,
n_bins: int | None = 1,
min_disp: float | None = None,
max_disp: float | None = None,
min_cov: int | None = 2,
max_cov: int | None = None,
n_top_statements: int | None = None,
subset: bool = False,
inplace: bool = True,
key_added: str = "highly_variable",
variance_mode: str = "overall",
bin_by: str = "coverage",
)
Identify highly variable statements in a vote matrix (AnnData).
Analogous to scanpy.pp.highly_variable_genes for single-cell data, this function identifies statements with high variability across participants. The function computes various dispersion metrics, normalizes them within bins, and marks statements as highly variable based on user-defined criteria.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData object containing vote matrix. |
required |
layer
|
str | None
|
Layer to use for computation. If None, uses |
None
|
n_bins
|
int | None
|
Number of bins for dispersion normalization. Values <=1 or None disable binning. Default is 1 (no binning). |
1
|
min_disp
|
float | None
|
Minimum normalized dispersion threshold for selecting highly variable statements.
Only used if |
None
|
max_disp
|
float | None
|
Maximum normalized dispersion threshold for selecting highly variable statements.
Only used if |
None
|
min_cov
|
int | None
|
Minimum coverage (number of non-NaN votes) required for a statement. Default is 2. |
2
|
max_cov
|
int | None
|
Maximum coverage threshold for selecting highly variable statements.
Only used if |
None
|
n_top_statements
|
int | None
|
Select this many top statements by normalized dispersion. If provided, overrides
|
None
|
subset
|
bool
|
If True, subset the AnnData object to highly variable statements. |
False
|
inplace
|
bool
|
If True, add results to |
True
|
key_added
|
str
|
Key under which to store the highly variable boolean mask in |
'highly_variable'
|
variance_mode
|
str
|
Which variance metric to use for computing dispersion: - "overall": variance of raw votes (including NaN as missing) - "valence": variance of engaged votes only (excluding passes/NaN) - "engagement": variance of engagement (1 if ±1, 0 if pass) Default is "overall". |
'overall'
|
bin_by
|
str
|
Variable to bin on for normalization. Options: - "coverage": number of non-NaN votes - "p_engaged": proportion of engaged votes (±1) - "mean_valence": average valence of engaged votes - "mean_abs_valence": absolute value of mean valence Default is "coverage". |
'coverage'
|
Returns:
| Type | Description |
|---|---|
DataFrame | None
|
If |
Examples:
Select top 50 most variable statements:
import valency_anndata as val
adata = val.datasets.aufstehen()
val.preprocessing.highly_variable_statements(adata, n_top_statements=50)
Use normalized dispersion thresholds with binning:
val.preprocessing.highly_variable_statements(
adata,
n_bins=10,
min_disp=0.5,
min_cov=5,
bin_by="coverage"
)
Focus on valence variance instead of overall variance:
val.preprocessing.highly_variable_statements(
adata,
n_top_statements=100,
variance_mode="valence"
)
Run multiple times with different settings using key_added:
# Identify top 50 statements
val.preprocessing.highly_variable_statements(
adata,
n_top_statements=50,
key_added="highly_variable_top50"
)
# Also identify top 100 statements
val.preprocessing.highly_variable_statements(
adata,
n_top_statements=100,
key_added="highly_variable_top100"
)
# Now you can use either mask with recipe_polis
val.tools.recipe_polis(adata, mask_var="highly_variable_top50")
Source code in src/valency_anndata/preprocessing/_highly_variable_statements.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | |
scanpy methods (inherited)¶
Note
These methods are simply quick convenience wrappers around methods in scanpy, a tool for single-cell gene expression. They will use terms like "cells", "genes" and "counts", but you can think of these as "participants", "statements" and "votes".
See scanpy.pp for more methods you can experiment with via the val.scanpy.pp namespace.
valency_anndata.preprocessing.neighbors ¶
neighbors(
adata: AnnData,
n_neighbors: int = 15,
n_pcs: int | None = None,
*,
use_rep: str | None = None,
knn: bool = True,
method: _Method = "umap",
transformer: KnnTransformerLike
| _KnownTransformer
| None = None,
metric: _Metric | _MetricFn = "euclidean",
metric_kwds: Mapping[str, Any] = MappingProxyType({}),
random_state: _LegacyRandom = 0,
key_added: str | None = None,
copy: bool = False,
) -> AnnData | None
Compute the nearest neighbors distance matrix and a neighborhood graph of observations :cite:p:McInnes2018.
The neighbor search efficiency of this heavily relies on UMAP :cite:p:McInnes2018,
which also provides a method for estimating connectivities of data points -
the connectivity of the manifold (method=='umap'). If method=='gauss',
connectivities are computed according to :cite:t:Coifman2005, in the adaption of
:cite:t:Haghverdi2016.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix. |
required |
n_neighbors
|
int
|
The size of local neighborhood (in terms of number of neighboring data
points) used for manifold approximation. Larger values result in more
global views of the manifold, while smaller values result in more local
data being preserved. In general values should be in the range 2 to 100.
If ignored if |
15
|
knn
|
bool
|
If |
True
|
method
|
_Method
|
Use 'umap' :cite:p: |
'umap'
|
transformer
|
KnnTransformerLike | _KnownTransformer | None
|
Approximate kNN search implementation following the API of
:class:
|
None
|
metric
|
_Metric | _MetricFn
|
A known metric’s name or a callable that returns a distance. ignored if |
'euclidean'
|
metric_kwds
|
Mapping[str, Any]
|
Options for the metric. ignored if |
MappingProxyType({})
|
random_state
|
_LegacyRandom
|
A numpy random seed. ignored if |
0
|
key_added
|
str | None
|
If not specified, the neighbors data is stored in |
None
|
copy
|
bool
|
Return a copy instead of writing to adata. |
False
|
Returns:
| Type | Description |
|---|---|
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:
|
|
`adata.obsp['distances' | key_added+'_distances']` : :class:`scipy.sparse.csr_matrix` (dtype `float`)
|
Distance matrix of the nearest neighbors search. Each row (cell) has |
`adata.obsp['connectivities' | key_added+'_connectivities']` : :class:`scipy.sparse._csr.csr_matrix` (dtype `float`)
|
Weighted adjacency matrix of the neighborhood graph of data points. Weights should be interpreted as connectivities. |
`adata.uns['neighbors' | key_added]` : :class:`dict`
|
neighbors parameters. |
Examples:
>>> import scanpy as sc
>>> adata = sc.datasets.pbmc68k_reduced()
>>> # Basic usage
>>> sc.pp.neighbors(adata, 20, metric="cosine")
>>> # Provide your own transformer for more control and flexibility
>>> from sklearn.neighbors import KNeighborsTransformer
>>> transformer = KNeighborsTransformer(
... n_neighbors=10, metric="manhattan", algorithm="kd_tree"
... )
>>> sc.pp.neighbors(adata, transformer=transformer)
>>> # now you can e.g. access the index: `transformer._tree`
See Also
:doc:/how-to/knn-transformers
Source code in .venv/lib/python3.10/site-packages/scanpy/neighbors/__init__.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | |