Skip to content

Datasets

Reference Datasets

These datasets are provided as a starting point for exploration and experimentation.

valency_anndata.datasets.aufstehen

aufstehen(translate_to: Optional[str] = None)

Polis conversation of 33k+ Germans, run by political party Aufstehen.

This is largest Polis conversation run as of now, in fall 2018.

See: https://compdemocracy.org/Case-studies/2018-germany-aufstehen/

The data is pulled from an archive at: https://huggingface.co/datasets/patcon/polis-aufstehen-2018

Note

This dataset has been augmented by merging is-meta and is-seed statement data (missing from the official CSV export) that were retreived from the Polis API. Specifically, is-meta is required in order to reproduce outputs of the Polis data pipeline.

Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project. The data and more information about how the data was collected can be found at the following link: https://pol.is/report/r6xd526vyjyjrj9navxrj

Source code in src/valency_anndata/datasets/_load_aufstehen.py
def aufstehen(
    translate_to: Optional[str] = None,
):
    """
    Polis conversation of 33k+ Germans, run by political party Aufstehen.

    This is largest Polis conversation run as of now, in fall 2018.

    See: <https://compdemocracy.org/Case-studies/2018-germany-aufstehen/>

    The data is pulled from an archive at:
    <https://huggingface.co/datasets/patcon/polis-aufstehen-2018>

    Note
    ----

    This dataset has been augmented by merging `is-meta` and `is-seed` statement
    data (missing from the official CSV export) that were retreived from the
    Polis API. Specifically, `is-meta` is required in order to reproduce outputs
    of the Polis data pipeline.

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project. The data and
    more information about how the data was collected can be found at the
    following link: <https://pol.is/report/r6xd526vyjyjrj9navxrj>
    """
    export_dir = snapshot_download(
        repo_id="patcon/polis-aufstehen-2018",
        repo_type="dataset",
        # Suppress HF_TOKEN warning.
        token=False,
    )
    adata = val.datasets.polis.load(source=export_dir, translate_to=translate_to)

    return adata

valency_anndata.datasets.chile_protest

chile_protest(translate_to: Optional[str] = None)

Polis conversation of 2,700+ Chileans during the 2019 #ChileDesperto protests.

It was run informally by a single citizen, with minimal support infrastructure, outreach strategy, or moderation process.

See: https://en.wikipedia.org/wiki/Social_Outburst_(Chile)

Note

This dataset has been augmented by merging is-meta and is-seed statement data (missing from the official CSV export) that were retreived from the Polis API. Specifically, is-meta is required in order to reproduce outputs of the Polis data pipeline.

Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project. The data and more information about how the data was collected can be found at the following link: https://pol.is/report/r29kkytnipymd3exbynkd

Source code in src/valency_anndata/datasets/_load_chile_protest.py
def chile_protest(
    translate_to: Optional[str] = None,
):
    """
    Polis conversation of 2,700+ Chileans during the 2019 #ChileDesperto protests.

    It was run informally by a single citizen, with minimal support
    infrastructure, outreach strategy, or moderation process.

    See: <https://en.wikipedia.org/wiki/Social_Outburst_(Chile)>

    Note
    ----

    This dataset has been augmented by merging `is-meta` and `is-seed` statement
    data (missing from the official CSV export) that were retreived from the
    Polis API. Specifically, `is-meta` is required in order to reproduce outputs
    of the Polis data pipeline.

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project. The data and
    more information about how the data was collected can be found at the
    following link: <https://pol.is/report/r29kkytnipymd3exbynkd>
    """
    adata = val.datasets.polis.load("https://pol.is/report/r29kkytnipymd3exbynkd", translate_to=translate_to)

    return adata

Polis

valency_anndata.datasets.polis.load

load(
    source: str,
    *,
    translate_to: Optional[str] = None,
    build_X: bool = True,
) -> AnnData

Load a Polis conversation or report into an AnnData object.

This function accepts either a URL or an ID for a Polis conversation or report, fetches raw vote events and statements via the Polis API or CSV export, and optionally constructs a participant × statement vote matrix in adata.X.

Parameters:

Name Type Description Default
source str

The Polis source to load. Supported formats include:

  • Full report URL: https://pol.is/report/<report_id>
  • Conversation URL: https://pol.is/<conversation_id>
  • Custom host URLs: https://<host>/report/<report_id> or https://<host>/<conversation_id>
  • Bare IDs:
    • Conversation ID (starts with a digit), e.g., 4asymkcrjf
    • Report ID (starts with 'r'), e.g., r4zdxrdscmukmkakmbz3k
  • Local directory containing CSV exports:
    • *votes.csv
    • *comments.csv

The function will automatically parse the source to determine whether it refers to a conversation or report and fetch the appropriate data.

required
translate_to str or None

Target language code (e.g., "en", "fr", "es") for translating statement text. If provided, the original statement text in adata.uns["statements"]["comment-body"] is translated and stored in adata.var["content"]. The adata.var["language_current"] field is updated to the target language, and adata.var["is_translated"] is set to True. Defaults to None (no translation).

None
build_X bool

If True, constructs a participant × statement vote matrix from the raw votes using rebuild_vote_matrix(). This populates adata.obs, adata.var, and adata.X (with a copy in adata.layers['raw_sparse']). After the first build, a snapshot of this initial matrix is stored in adata.raw.

True

Returns:

Name Type Description
adata AnnData

An AnnData object containing the loaded Polis data.

DataFrame

adata.uns["votes"]
Raw vote events fetched from the API or CSV export.

dict

adata.uns["votes_meta"]
Metadata about the sources of votes, e.g., API vs CSV.

DataFrame

adata.uns["statements"]
Raw statements/comments for the conversation.

dict

adata.uns["statements_meta"]
Metadata about the statements source.

dict

adata.uns["source"]
Basic information about the Polis source (base URL, conversation ID, report ID).

dict

adata.uns["schema"]
High-level description of X and votes.

ndarray

adata.X (if build_X=True)
Participant × statement vote matrix (rows = participants, columns = statements).

DataFrame

adata.obs (if build_X=True)
Participant metadata (index = voter IDs).

DataFrame

adata.var (if build_X=True)
Statement metadata (index = statement IDs).

AnnData

adata.raw (if build_X=True)
Snapshot of the first vote matrix and associated metadata. This allows downstream filtering or processing without losing the original vote matrix.

Notes
  • If build_X=False, only adata.uns will be populated, containing the raw votes and statements, and .X, .obs, .var, and .raw will remain empty.
  • adata.raw is assigned only after the first vote matrix build and is intended to be immutable.
  • If translate_to is provided, adata.var["content"] is updated with translated text and adata.var["language_current"] is set to the target language.
  • The vote matrix is derived from the most recent votes per participant per statement, sorted by timestamp.

Examples:

Load data from a report or conversation ID or URL.

adata = val.datasets.polis.load("https://pol.is/report/r2dfw8eambusb8buvecjt")
adata = val.datasets.polis.load("6rphtwwfn4")

Load data from an alternative Polis instance via URL.

adata = val.datasets.polis.load("https://polis.tw/6rphtwwfn4")

Load data from a path containing Polis CSV export files.

$ ls exports/my_conversation_2024_11_03
comments.csv votes.csv summary.csv ...
adata = val.datasets.polis.load("./exports/my_conversation_2024_11_03")
Source code in src/valency_anndata/datasets/polis.py
def load(source: str, *, translate_to: Optional[str] = None, build_X: bool = True) -> AnnData:
    """
    Load a Polis conversation or report into an AnnData object.

    This function accepts either a URL or an ID for a Polis conversation or report,
    fetches raw vote events and statements via the Polis API or CSV export, and
    optionally constructs a participant × statement vote matrix in `adata.X`.

    Parameters
    ----------
    source : str
        The Polis source to load. Supported formats include:

        - Full report URL: `https://pol.is/report/<report_id>`
        - Conversation URL: `https://pol.is/<conversation_id>`
        - Custom host URLs: `https://<host>/report/<report_id>` or `https://<host>/<conversation_id>`
        - Bare IDs:
            - Conversation ID (starts with a digit), e.g., `4asymkcrjf`
            - Report ID (starts with 'r'), e.g., `r4zdxrdscmukmkakmbz3k`
        - Local directory containing CSV exports:
            - *votes.csv
            - *comments.csv

        The function will automatically parse the source to determine whether
        it refers to a conversation or report and fetch the appropriate data.


    translate_to : str or None, optional
        Target language code (e.g., "en", "fr", "es") for translating statement text.
        If provided, the original statement text in `adata.uns["statements"]["comment-body"]`
        is translated and stored in `adata.var["content"]`. The `adata.var["language_current"]`
        field is updated to the target language, and `adata.var["is_translated"]` is set to True.
        Defaults to None (no translation).

    build_X : bool, default True
        If True, constructs a participant × statement vote matrix from the raw
        votes using `rebuild_vote_matrix()`. This populates `adata.obs`,
        `adata.var`, and `adata.X` (with a copy in
        `adata.layers['raw_sparse']`). After the first build, a snapshot of this
        initial matrix is stored in `adata.raw`.

    Returns
    -------
    adata : anndata.AnnData
        An AnnData object containing the loaded Polis data.


    pd.DataFrame
        `adata.uns["votes"]`  
        Raw vote events fetched from the API or CSV export.
    dict
        `adata.uns["votes_meta"]`  
        Metadata about the sources of votes, e.g., API vs CSV.
    pd.DataFrame
        `adata.uns["statements"]`  
        Raw statements/comments for the conversation.
    dict
        `adata.uns["statements_meta"]`  
        Metadata about the statements source.
    dict
        `adata.uns["source"]`  
        Basic information about the Polis source (base URL, conversation ID, report ID).
    dict
        `adata.uns["schema"]`  
        High-level description of `X` and `votes`.
    np.ndarray
        `adata.X` (if `build_X=True`)  
        Participant × statement vote matrix (rows = participants, columns = statements).
    pd.DataFrame 
        `adata.obs` (if `build_X=True`)  
        Participant metadata (index = voter IDs).
    pd.DataFrame 
        `adata.var` (if `build_X=True`)  
        Statement metadata (index = statement IDs).
    anndata.AnnData 
        `adata.raw` (if `build_X=True`)  
        Snapshot of the first vote matrix and associated metadata. This allows
        downstream filtering or processing without losing the original vote matrix.

    Notes
    -----
    - If `build_X=False`, only `adata.uns` will be populated, containing the raw
      votes and statements, and `.X`, `.obs`, `.var`, and `.raw` will remain empty.
    - `adata.raw` is assigned only after the first vote matrix build and is intended
      to be immutable.
    - If `translate_to` is provided, `adata.var["content"]` is updated with translated
    text and `adata.var["language_current"]` is set to the target language.
    - The vote matrix is derived from the most recent votes per participant per statement,
      sorted by timestamp.

    Examples
    --------

    Load data from a report or conversation ID or URL.

    ```py
    adata = val.datasets.polis.load("https://pol.is/report/r2dfw8eambusb8buvecjt")
    adata = val.datasets.polis.load("6rphtwwfn4")
    ```

    Load data from an alternative Polis instance via URL.

    ```py
    adata = val.datasets.polis.load("https://polis.tw/6rphtwwfn4")
    ```

    Load data from a path containing Polis CSV export files.

    ```sh
    $ ls exports/my_conversation_2024_11_03
    comments.csv votes.csv summary.csv ...
    ```

    ```py
    adata = val.datasets.polis.load("./exports/my_conversation_2024_11_03")
    ```
    """
    adata = _load_raw_polis_data(source)

    if build_X:
        rebuild_vote_matrix(adata, trim_rule=1.0, inplace=True)
        adata.raw = adata.copy()
        # Store a copy in case we bring something else into X workspace later.
        adata.layers["raw_sparse"] = adata.X # type: ignore[arg-type]

    _populate_var_statements(adata, translate_to=translate_to)

    # if convo_meta.conversation_id:
    #     xids = client.get_xids(conversation_id=convo_meta.conversation_id)
    #     adata.uns["xids"] = pd.DataFrame(xids)

    return adata

valency_anndata.datasets.polis.translate_statements

translate_statements(
    adata: AnnData,
    translate_to: Optional[str],
    inplace: bool = True,
) -> Optional[list[str]]

Translate statements in adata.uns['statements']['comment-body'] into another language, or copy originals if translate_to is None.

Parameters:

Name Type Description Default
adata AnnData

AnnData object containing uns['statements'] and var_names.

required
translate_to Optional[str]

Target language code (e.g., "en", "fr", "es").

required
inplace bool

If True, updates adata.var['content'] and adata.var['language_current']. If False, returns a list of translated strings without modifying adata.

True

Returns:

Name Type Description
translated_texts list[str] | None

List of translated texts if inplace=False, else None.

Source code in src/valency_anndata/datasets/polis.py
def translate_statements(
    adata: AnnData,
    translate_to: Optional[str],
    inplace: bool = True
) -> Optional[list[str]]:
    """
    Translate statements in `adata.uns['statements']['comment-body']` into another language,
    or copy originals if translate_to is None.

    Parameters
    ----------
    adata : AnnData
        AnnData object containing `uns['statements']` and `var_names`.
    translate_to : Optional[str]
        Target language code (e.g., "en", "fr", "es").
    inplace : bool, default True
        If True, updates `adata.var['content']` and `adata.var['language_current']`.
        If False, returns a list of translated strings without modifying `adata`.

    Returns
    -------
    translated_texts : list[str] | None
        List of translated texts if `inplace=False`, else None.
    """
    statements_aligned = adata.uns["statements"].copy()
    statements_aligned.index = statements_aligned.index.astype(str)
    statements_aligned = statements_aligned.reindex(adata.var_names)

    original_texts = statements_aligned["comment-body"].tolist()

    # ───────────────────────────────────────────
    # NO-TRANSLATION PATH (explicit)
    # ───────────────────────────────────────────
    if translate_to is None:
        if inplace:
            adata.var["content"] = original_texts
            adata.var["language_current"] = adata.var["language_original"]
            adata.var["is_translated"] = False
            return None
        else:
            return original_texts


    # ───────────────────────────────────────────
    # TRANSLATION PATH
    # ───────────────────────────────────────────
    translated_texts = run_async(
        _translate_texts_async(original_texts, translate_to)
    )

    if inplace:
        adata.var["content"] = translated_texts
        adata.var["language_current"] = translate_to
        adata.var["is_translated"] = True
        return None
    else:
        return translated_texts