Datasets

Reference Datasets¶

These datasets are provided as a starting point for exploration and experimentation.

valency_anndata.datasets.aufstehen ¶

aufstehen(translate_to: Optional[str] = None)

Polis conversation of 33k+ Germans, run by political party Aufstehen.

This is largest Polis conversation run as of now, in fall 2018.

See: https://compdemocracy.org/Case-studies/2018-germany-aufstehen/

The data is pulled from an archive at: https://huggingface.co/datasets/patcon/polis-aufstehen-2018

Note

This dataset has been augmented by merging is-meta and is-seed statement data (missing from the official CSV export) that were retreived from the Polis API. Specifically, is-meta is required in order to reproduce outputs of the Polis data pipeline.

Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project. The data and more information about how the data was collected can be found at the following link: https://pol.is/report/r6xd526vyjyjrj9navxrj

Source code in src/valency_anndata/datasets/_load_aufstehen.py

def aufstehen(
    translate_to: Optional[str] = None,
):
    """
    Polis conversation of 33k+ Germans, run by political party Aufstehen.

    This is largest Polis conversation run as of now, in fall 2018.

    See: <https://compdemocracy.org/Case-studies/2018-germany-aufstehen/>

    The data is pulled from an archive at:
    <https://huggingface.co/datasets/patcon/polis-aufstehen-2018>

    Note
    ----

    This dataset has been augmented by merging `is-meta` and `is-seed` statement
    data (missing from the official CSV export) that were retreived from the
    Polis API. Specifically, `is-meta` is required in order to reproduce outputs
    of the Polis data pipeline.

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project. The data and
    more information about how the data was collected can be found at the
    following link: <https://pol.is/report/r6xd526vyjyjrj9navxrj>
    """
    export_dir = snapshot_download(
        repo_id="patcon/polis-aufstehen-2018",
        repo_type="dataset",
        # Suppress HF_TOKEN warning.
        token=False,
    )
    adata = val.datasets.polis.load(source=export_dir, translate_to=translate_to)

    return adata

valency_anndata.datasets.chile_protest ¶

chile_protest(translate_to: Optional[str] = None)

Polis conversation of 2,700+ Chileans during the 2019 #ChileDesperto protests.

It was run informally by a single citizen, with minimal support infrastructure, outreach strategy, or moderation process.

See: https://en.wikipedia.org/wiki/Social_Outburst_(Chile)

Note

This dataset has been augmented by merging is-meta and is-seed statement data (missing from the official CSV export) that were retreived from the Polis API. Specifically, is-meta is required in order to reproduce outputs of the Polis data pipeline.

Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project. The data and more information about how the data was collected can be found at the following link: https://pol.is/report/r29kkytnipymd3exbynkd

Source code in src/valency_anndata/datasets/_load_chile_protest.py

def chile_protest(
    translate_to: Optional[str] = None,
):
    """
    Polis conversation of 2,700+ Chileans during the 2019 #ChileDesperto protests.

    It was run informally by a single citizen, with minimal support
    infrastructure, outreach strategy, or moderation process.

    See: <https://en.wikipedia.org/wiki/Social_Outburst_(Chile)>

    Note
    ----

    This dataset has been augmented by merging `is-meta` and `is-seed` statement
    data (missing from the official CSV export) that were retreived from the
    Polis API. Specifically, `is-meta` is required in order to reproduce outputs
    of the Polis data pipeline.

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project. The data and
    more information about how the data was collected can be found at the
    following link: <https://pol.is/report/r29kkytnipymd3exbynkd>
    """
    adata = val.datasets.polis.load("https://pol.is/report/r29kkytnipymd3exbynkd", translate_to=translate_to)

    return adata

Polis¶

valency_anndata.datasets.polis.load ¶

load(
    source: str,
    *,
    translate_to: Optional[str] = None,
    build_X: bool = True,
) -> AnnData

Load a Polis conversation or report into an AnnData object.

This function accepts either a URL or an ID for a Polis conversation or report, fetches raw vote events and statements via the Polis API or CSV export, and optionally constructs a participant × statement vote matrix in adata.X.

Parameters:

Name	Type	Description	Default
`source`	`str`	The Polis source to load. Supported formats include: Full report URL: `https://pol.is/report/<report_id>` Conversation URL: `https://pol.is/<conversation_id>` Custom host URLs: `https://<host>/report/<report_id>` or `https://<host>/<conversation_id>` Bare IDs: Conversation ID (starts with a digit), e.g., `4asymkcrjf` Report ID (starts with 'r'), e.g., `r4zdxrdscmukmkakmbz3k` Local directory containing CSV exports: votes.csv comments.csv The function will automatically parse the source to determine whether it refers to a conversation or report and fetch the appropriate data.	required
`translate_to`	`str or None`	Target language code (e.g., "en", "fr", "es") for translating statement text. If provided, the original statement text in `adata.uns["statements"]["comment-body"]` is translated and stored in `adata.var["content"]`. The `adata.var["language_current"]` field is updated to the target language, and `adata.var["is_translated"]` is set to True. Defaults to None (no translation).	`None`
`build_X`	`bool`	If True, constructs a participant × statement vote matrix from the raw votes using `rebuild_vote_matrix()`. This populates `adata.obs`, `adata.var`, and `adata.X` (with a copy in `adata.layers['raw_sparse']`). After the first build, a snapshot of this initial matrix is stored in `adata.raw`.	`True`

Returns:

Name	Type	Description
`adata`	`AnnData`	An AnnData object containing the loaded Polis data.
	`DataFrame`	`adata.uns["votes"]` Raw vote events fetched from the API or CSV export.
	`dict`	`adata.uns["votes_meta"]` Metadata about the sources of votes, e.g., API vs CSV.
	`DataFrame`	`adata.uns["statements"]` Raw statements/comments for the conversation.
	`dict`	`adata.uns["statements_meta"]` Metadata about the statements source.
	`dict`	`adata.uns["source"]` Basic information about the Polis source (base URL, conversation ID, report ID).
	`dict`	`adata.uns["schema"]` High-level description of `X` and `votes`.
	`ndarray`	`adata.X` (if `build_X=True`) Participant × statement vote matrix (rows = participants, columns = statements).
	`DataFrame`	`adata.obs` (if `build_X=True`) Participant metadata (index = voter IDs).
	`DataFrame`	`adata.var` (if `build_X=True`) Statement metadata (index = statement IDs).
	`AnnData`	`adata.raw` (if `build_X=True`) Snapshot of the first vote matrix and associated metadata. This allows downstream filtering or processing without losing the original vote matrix.

Notes

If build_X=False, only adata.uns will be populated, containing the raw votes and statements, and .X, .obs, .var, and .raw will remain empty.
adata.raw is assigned only after the first vote matrix build and is intended to be immutable.
If translate_to is provided, adata.var["content"] is updated with translated text and adata.var["language_current"] is set to the target language.
The vote matrix is derived from the most recent votes per participant per statement, sorted by timestamp.

Examples:

Load data from a report or conversation ID or URL.

adata = val.datasets.polis.load("https://pol.is/report/r2dfw8eambusb8buvecjt")
adata = val.datasets.polis.load("6rphtwwfn4")

Load data from an alternative Polis instance via URL.

adata = val.datasets.polis.load("https://polis.tw/6rphtwwfn4")

Load data from a path containing Polis CSV export files.

$ ls exports/my_conversation_2024_11_03
comments.csv votes.csv summary.csv ...

adata = val.datasets.polis.load("./exports/my_conversation_2024_11_03")

Source code in src/valency_anndata/datasets/polis.py

def load(source: str, *, translate_to: Optional[str] = None, build_X: bool = True) -> AnnData:
    """
    Load a Polis conversation or report into an AnnData object.

    This function accepts either a URL or an ID for a Polis conversation or report,
    fetches raw vote events and statements via the Polis API or CSV export, and
    optionally constructs a participant × statement vote matrix in `adata.X`.

    Parameters
    ----------
    source : str
        The Polis source to load. Supported formats include:

        - Full report URL: `https://pol.is/report/<report_id>`
        - Conversation URL: `https://pol.is/<conversation_id>`
        - Custom host URLs: `https://<host>/report/<report_id>` or `https://<host>/<conversation_id>`
        - Bare IDs:
            - Conversation ID (starts with a digit), e.g., `4asymkcrjf`
            - Report ID (starts with 'r'), e.g., `r4zdxrdscmukmkakmbz3k`
        - Local directory containing CSV exports:
            - *votes.csv
            - *comments.csv

        The function will automatically parse the source to determine whether
        it refers to a conversation or report and fetch the appropriate data.


    translate_to : str or None, optional
        Target language code (e.g., "en", "fr", "es") for translating statement text.
        If provided, the original statement text in `adata.uns["statements"]["comment-body"]`
        is translated and stored in `adata.var["content"]`. The `adata.var["language_current"]`
        field is updated to the target language, and `adata.var["is_translated"]` is set to True.
        Defaults to None (no translation).

    build_X : bool, default True
        If True, constructs a participant × statement vote matrix from the raw
        votes using `rebuild_vote_matrix()`. This populates `adata.obs`,
        `adata.var`, and `adata.X` (with a copy in
        `adata.layers['raw_sparse']`). After the first build, a snapshot of this
        initial matrix is stored in `adata.raw`.

    Returns
    -------
    adata : anndata.AnnData
        An AnnData object containing the loaded Polis data.


    pd.DataFrame
        `adata.uns["votes"]`  
        Raw vote events fetched from the API or CSV export.
    dict
        `adata.uns["votes_meta"]`  
        Metadata about the sources of votes, e.g., API vs CSV.
    pd.DataFrame
        `adata.uns["statements"]`  
        Raw statements/comments for the conversation.
    dict
        `adata.uns["statements_meta"]`  
        Metadata about the statements source.
    dict
        `adata.uns["source"]`  
        Basic information about the Polis source (base URL, conversation ID, report ID).
    dict
        `adata.uns["schema"]`  
        High-level description of `X` and `votes`.
    np.ndarray
        `adata.X` (if `build_X=True`)  
        Participant × statement vote matrix (rows = participants, columns = statements).
    pd.DataFrame 
        `adata.obs` (if `build_X=True`)  
        Participant metadata (index = voter IDs).
    pd.DataFrame 
        `adata.var` (if `build_X=True`)  
        Statement metadata (index = statement IDs).
    anndata.AnnData 
        `adata.raw` (if `build_X=True`)  
        Snapshot of the first vote matrix and associated metadata. This allows
        downstream filtering or processing without losing the original vote matrix.

    Notes
    -----
    - If `build_X=False`, only `adata.uns` will be populated, containing the raw
      votes and statements, and `.X`, `.obs`, `.var`, and `.raw` will remain empty.
    - `adata.raw` is assigned only after the first vote matrix build and is intended
      to be immutable.
    - If `translate_to` is provided, `adata.var["content"]` is updated with translated
    text and `adata.var["language_current"]` is set to the target language.
    - The vote matrix is derived from the most recent votes per participant per statement,
      sorted by timestamp.

    Examples
    --------

    Load data from a report or conversation ID or URL.

    ```py
    adata = val.datasets.polis.load("https://pol.is/report/r2dfw8eambusb8buvecjt")
    adata = val.datasets.polis.load("6rphtwwfn4")
    ```

    Load data from an alternative Polis instance via URL.

    ```py
    adata = val.datasets.polis.load("https://polis.tw/6rphtwwfn4")
    ```

    Load data from a path containing Polis CSV export files.

    ```sh
    $ ls exports/my_conversation_2024_11_03
    comments.csv votes.csv summary.csv ...
    ```

    ```py
    adata = val.datasets.polis.load("./exports/my_conversation_2024_11_03")
    ```
    """
    adata = _load_raw_polis_data(source)

    if build_X:
        rebuild_vote_matrix(adata, trim_rule=1.0, inplace=True)
        adata.raw = adata.copy()
        # Store a copy in case we bring something else into X workspace later.
        adata.layers["raw_sparse"] = adata.X # type: ignore[arg-type]

    _populate_var_statements(adata, translate_to=translate_to)

    # if convo_meta.conversation_id:
    #     xids = client.get_xids(conversation_id=convo_meta.conversation_id)
    #     adata.uns["xids"] = pd.DataFrame(xids)

    return adata

valency_anndata.datasets.polis.translate_statements ¶

translate_statements(
    adata: AnnData,
    translate_to: Optional[str],
    inplace: bool = True,
) -> Optional[list[str]]

Translate statements in adata.uns['statements']['comment-body'] into another language, or copy originals if translate_to is None.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object containing `uns['statements']` and `var_names`.	required
`translate_to`	`Optional[str]`	Target language code (e.g., "en", "fr", "es").	required
`inplace`	`bool`	If True, updates `adata.var['content']` and `adata.var['language_current']`. If False, returns a list of translated strings without modifying `adata`.	`True`

Returns:

Name	Type	Description
`translated_texts`	`list[str] \| None`	List of translated texts if `inplace=False`, else None.

Source code in src/valency_anndata/datasets/polis.py

def translate_statements(
    adata: AnnData,
    translate_to: Optional[str],
    inplace: bool = True
) -> Optional[list[str]]:
    """
    Translate statements in `adata.uns['statements']['comment-body']` into another language,
    or copy originals if translate_to is None.

    Parameters
    ----------
    adata : AnnData
        AnnData object containing `uns['statements']` and `var_names`.
    translate_to : Optional[str]
        Target language code (e.g., "en", "fr", "es").
    inplace : bool, default True
        If True, updates `adata.var['content']` and `adata.var['language_current']`.
        If False, returns a list of translated strings without modifying `adata`.

    Returns
    -------
    translated_texts : list[str] | None
        List of translated texts if `inplace=False`, else None.
    """
    statements_aligned = adata.uns["statements"].copy()
    statements_aligned.index = statements_aligned.index.astype(str)
    statements_aligned = statements_aligned.reindex(adata.var_names)

    original_texts = statements_aligned["comment-body"].tolist()

    # ───────────────────────────────────────────
    # NO-TRANSLATION PATH (explicit)
    # ───────────────────────────────────────────
    if translate_to is None:
        if inplace:
            adata.var["content"] = original_texts
            adata.var["language_current"] = adata.var["language_original"]
            adata.var["is_translated"] = False
            return None
        else:
            return original_texts


    # ───────────────────────────────────────────
    # TRANSLATION PATH
    # ───────────────────────────────────────────
    translated_texts = run_async(
        _translate_texts_async(original_texts, translate_to)
    )

    if inplace:
        adata.var["content"] = translated_texts
        adata.var["language_current"] = translate_to
        adata.var["is_translated"] = True
        return None
    else:
        return translated_texts