Skip to content

Datasets

Data Overview

Dataset Participants1 Statements2 Completeness3 Fingerprint
vTaiwan: Uber
(Jun 2015)
1,324 / 2,229 101 / 199 54% / 52% / 46% / 38%
(26 / 51 / 76 / 101)
vTaiwan: Airbnb
(Aug 2015)
864 / 1,675 237 / 245 26% / 18% / 14% / 11%
(60 / 119 / 178 / 237)
vTaiwan: Online Alcohol Sales
(Mar 2016)
374 / 639 72 / 72 47% / 42% / 38% / 32%
(18 / 36 / 54 / 72)
vTaiwan: Caning
(Nov 2017)
1,852 / 2,194 340 / 602 21% / 14% / 10% / 12%
(85 / 170 / 255 / 340)
American Assembly: Bowling Green
(Feb 2018)
1,583 / 2,044 633 / 896 39% / 34% / 27% / 22%
(159 / 317 / 475 / 633)
Aufstehen
(Sep 2018)
23,354 / 33,422 161 / 783 62% / 58% / 53% / 52%
(41 / 81 / 121 / 161)
American Assembly: Louisville
(Mar 2019)
1,163 / 1,398 603 / 877 27% / 25% / 21% / 18%
(151 / 302 / 453 / 603)
Chile Protests
(Nov 2019)
1,743 / 2,739 399 / 1045 15% / 13% / 10% / 8%
(100 / 200 / 300 / 399)
Cuba 15N: Before (1)
(Oct 2021)
243 / 277 122 / 123 80% / 73% / 60% / 51%
(31 / 61 / 92 / 122)
Cuba 15N: Before (2)
(Nov 2021)
1,276 / 1,712 413 / 1018 37% / 30% / 24% / 19%
(104 / 207 / 310 / 413)
Cuba 15N: After
(Nov 2021)
308 / 478 325 / 340 38% / 31% / 23% / 18%
(82 / 163 / 244 / 325)
Klimarat: Food & Land Use
(Apr 2022)
2,968 / 3,616 862 / 1452 26% / 17% / 13% / 10%
(216 / 431 / 647 / 862)
Klimarat: Mobility
(Apr 2022)
2,660 / 3,142 1064 / 2138 23% / 17% / 13% / 10%
(266 / 532 / 798 / 1064)
Klimarat: Energy
(Apr 2022)
1,443 / 1,765 625 / 1040 28% / 21% / 17% / 14%
(157 / 313 / 469 / 625)
Klimarat: Housing
(Apr 2022)
1,261 / 1,503 369 / 611 34% / 26% / 21% / 17%
(93 / 185 / 277 / 369)
Klimarat: Production & Consumption
(Apr 2022)
900 / 1,116 337 / 522 38% / 30% / 24% / 20%
(85 / 169 / 253 / 337)
BG 2050
(Feb 2025)
6,609 / 7,890 3983 / 7730 12% / 7% / 5% / 4%
(996 / 1992 / 2988 / 3983)
Japan Choice: Foreign Affairs & Security (2025)
(Jul 2025)
4,016 / 4,616 20 / 20 98% / 98% / 98% / 98%
(5 / 10 / 15 / 20)
Japan Choice: Diversity & Human Rights (2025)
(Jul 2025)
4,001 / 4,354 8 / 8 100% / 100% / 100% / 100%
(2 / 4 / 6 / 8)
Japan Choice: Education, Children & Old Age Care (2025)
(Jul 2025)
4,285 / 4,723 13 / 13 99% / 99% / 99% / 99%
(4 / 7 / 10 / 13)
Japan Choice: Economy, Taxation & Employment (2025)
(Jul 2025)
10,560 / 12,846 18 / 18 98% / 98% / 98% / 98%
(5 / 9 / 14 / 18)
Japan Choice: Foreign Affairs & Security (2026)
(Jan 2026)
1,653 / 2,140 19 / 19 100% / 100% / 99% / 99%
(5 / 10 / 15 / 19)
Japan Choice: Diversity & Human Rights (2026)
(Jan 2026)
1,546 / 1,833 9 / 9 100% / 100% / 100% / 100%
(3 / 5 / 7 / 9)
Japan Choice: Education, Children & Old Age Care (2026)
(Jan 2026)
1,730 / 1,985 8 / 8 100% / 100% / 100% / 100%
(2 / 4 / 6 / 8)
Japan Choice: Economy, Taxation & Employment (2026)
(Jan 2026)
3,392 / 4,526 20 / 20 100% / 99% / 98% / 97%
(5 / 10 / 15 / 20)

Reference Datasets

These datasets are provided as a starting point for exploration and experimentation.

valency_anndata.datasets.aufstehen

aufstehen(translate_to: Optional[str] = None, **kwargs)

Polis conversation of 33k+ Germans, run by political party Aufstehen.

This is largest Polis conversation run as of now, in fall 2018.

See: https://compdemocracy.org/Case-studies/2018-germany-aufstehen/

The data is pulled from an archive at: https://huggingface.co/datasets/patcon/polis-aufstehen-2018

Note

This dataset has been augmented by merging is-meta and is-seed statement data (missing from the official CSV export) that were retreived from the Polis API. Specifically, is-meta is required in order to reproduce outputs of the Polis data pipeline.

Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project. The data and more information about how the data was collected can be found at the following link: https://pol.is/report/r6xd526vyjyjrj9navxrj

Source code in src/valency_anndata/datasets/_load_aufstehen.py
def aufstehen(
    translate_to: Optional[str] = None,
    **kwargs,
):
    """
    Polis conversation of 33k+ Germans, run by political party Aufstehen.

    This is largest Polis conversation run as of now, in fall 2018.

    See: <https://compdemocracy.org/Case-studies/2018-germany-aufstehen/>

    The data is pulled from an archive at:
    <https://huggingface.co/datasets/patcon/polis-aufstehen-2018>

    Note
    ----

    This dataset has been augmented by merging `is-meta` and `is-seed` statement
    data (missing from the official CSV export) that were retreived from the
    Polis API. Specifically, `is-meta` is required in order to reproduce outputs
    of the Polis data pipeline.

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project. The data and
    more information about how the data was collected can be found at the
    following link: <https://pol.is/report/r6xd526vyjyjrj9navxrj>
    """
    adata = val.datasets.polis.load(
        source="huggingface:patcon/polis-aufstehen-2018",
        translate_to=translate_to,
        **kwargs,
    )

    return adata

valency_anndata.datasets.american_assembly

american_assembly(
    city: AmericanAssemblyCity | str,
    translate_to: Optional[str] = None,
    **kwargs,
)

Polis conversations run by the American Assembly in Kentucky cities.

The American Assembly is a public affairs organization that has used Polis to facilitate civic dialogue. These conversations were run in Bowling Green and Louisville, Kentucky.

Parameters:

Name Type Description Default
city str

The city conversation to load. One of:

  • "bowling_green" — Bowling Green, KY (2018)
  • "louisville" — Louisville, KY (2019)
required
translate_to str or None

Target language code (e.g., "en", "fr") for translating statement text. Defaults to None (no translation).

None

Returns:

Name Type Description
adata AnnData

AnnData object containing the loaded Polis conversation.

Examples:

Load the Bowling Green conversation:

adata = val.datasets.american_assembly(city="bowling_green")

Load the Louisville conversation translated to French:

adata = val.datasets.american_assembly(city="louisville", translate_to="fr")
Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project.

Source code in src/valency_anndata/datasets/_load_american_assembly.py
def american_assembly(
    city: AmericanAssemblyCity | str,
    translate_to: Optional[str] = None,
    **kwargs,
):
    """
    Polis conversations run by the American Assembly in Kentucky cities.

    The American Assembly is a public affairs organization that has used Polis
    to facilitate civic dialogue. These conversations were run in Bowling Green
    and Louisville, Kentucky.

    Parameters
    ----------
    city : str
        The city conversation to load. One of:

        - ``"bowling_green"`` — Bowling Green, KY (2018)
        - ``"louisville"`` — Louisville, KY (2019)

    translate_to : str or None, optional
        Target language code (e.g., ``"en"``, ``"fr"``) for translating
        statement text. Defaults to None (no translation).

    Returns
    -------
    adata : anndata.AnnData
        AnnData object containing the loaded Polis conversation.

    Examples
    --------
    Load the Bowling Green conversation:

    ```py
    adata = val.datasets.american_assembly(city="bowling_green")
    ```

    Load the Louisville conversation translated to French:

    ```py
    adata = val.datasets.american_assembly(city="louisville", translate_to="fr")
    ```

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project.
    """
    if city not in _CITY_URLS:
        raise ValueError(f"Unknown city {city!r}. Must be one of: {list(_CITY_URLS)}")
    url = _CITY_URLS[city]

    adata = val.datasets.polis.load(url, translate_to=translate_to, **kwargs)

    return adata

valency_anndata.datasets.bg2050

bg2050(translate_to: Optional[str] = None, **kwargs)

Polis conversation from the BG 2050 community visioning project.

A 33-day digital engagement where nearly 7,900 residents of Bowling Green and Warren County, Kentucky, shared ideas for the region's future. The project was commissioned by Warren County government in response to projections that the county will nearly double in size over 25 years, and was executed by Innovation Engine in partnership with The Computational Democracy Project and Google's Jigsaw.

See: https://whatcouldbgbe.com/about-the-project

Parameters:

Name Type Description Default
translate_to str or None

Target language code (e.g., "en", "fr") for translating statement text. Defaults to None (no translation).

None

Returns:

Name Type Description
adata AnnData

AnnData object containing the loaded Polis conversation.

Examples:

Load the BG 2050 conversation:

adata = val.datasets.bg2050()

Load translated to French:

adata = val.datasets.bg2050(translate_to="fr")
Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project. The data and more information about how the data was collected can be found at the following link: https://pol.is/report/r7wehfsmutrwndviddnii

Source code in src/valency_anndata/datasets/_load_bg2050.py
def bg2050(
    translate_to: Optional[str] = None,
    **kwargs,
):
    """
    Polis conversation from the BG 2050 community visioning project.

    A 33-day digital engagement where nearly 7,900 residents of Bowling Green
    and Warren County, Kentucky, shared ideas for the region's future. The
    project was commissioned by Warren County government in response to
    projections that the county will nearly double in size over 25 years, and
    was executed by Innovation Engine in partnership with The Computational
    Democracy Project and Google's Jigsaw.

    See: <https://whatcouldbgbe.com/about-the-project>

    Parameters
    ----------
    translate_to : str or None, optional
        Target language code (e.g., ``"en"``, ``"fr"``) for translating
        statement text. Defaults to None (no translation).

    Returns
    -------
    adata : anndata.AnnData
        AnnData object containing the loaded Polis conversation.

    Examples
    --------
    Load the BG 2050 conversation:

    ```py
    adata = val.datasets.bg2050()
    ```

    Load translated to French:

    ```py
    adata = val.datasets.bg2050(translate_to="fr")
    ```

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project. The data and
    more information about how the data was collected can be found at the
    following link: <https://pol.is/report/r7wehfsmutrwndviddnii>
    """
    adata = val.datasets.polis.load("https://pol.is/report/r7wehfsmutrwndviddnii", translate_to=translate_to, **kwargs)

    return adata

valency_anndata.datasets.chile_protest

chile_protest(translate_to: Optional[str] = None, **kwargs)

Polis conversation of 2,700+ Chileans during the 2019 #ChileDesperto protests.

It was run informally by a single citizen, with minimal support infrastructure, outreach strategy, or moderation process.

See: https://en.wikipedia.org/wiki/Social_Outburst_(Chile)

Note

This dataset has been augmented by merging is-meta and is-seed statement data (missing from the official CSV export) that were retreived from the Polis API. Specifically, is-meta is required in order to reproduce outputs of the Polis data pipeline.

Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project. The data and more information about how the data was collected can be found at the following link: https://pol.is/report/r29kkytnipymd3exbynkd

Source code in src/valency_anndata/datasets/_load_chile_protest.py
def chile_protest(
    translate_to: Optional[str] = None,
    **kwargs,
):
    """
    Polis conversation of 2,700+ Chileans during the 2019 #ChileDesperto protests.

    It was run informally by a single citizen, with minimal support
    infrastructure, outreach strategy, or moderation process.

    See: <https://en.wikipedia.org/wiki/Social_Outburst_(Chile)>

    Note
    ----

    This dataset has been augmented by merging `is-meta` and `is-seed` statement
    data (missing from the official CSV export) that were retreived from the
    Polis API. Specifically, `is-meta` is required in order to reproduce outputs
    of the Polis data pipeline.

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project. The data and
    more information about how the data was collected can be found at the
    following link: <https://pol.is/report/r29kkytnipymd3exbynkd>
    """
    adata = val.datasets.polis.load("https://pol.is/report/r29kkytnipymd3exbynkd", translate_to=translate_to, **kwargs)

    return adata

valency_anndata.datasets.cuba_protest

cuba_protest(
    period: CubaProtestPeriod | str,
    translate_to: Optional[str] = None,
    **kwargs,
)

Polis conversations run around Cuba's planned 15N march (November 2021).

The 15N march was a peaceful protest planned for November 15, 2021, but was suppressed by the Cuban government before it could take place. Three conversations were run in sequence — two before the planned march and one after its suppression — allowing longitudinal comparison of public opinion around the event.

Parameters:

Name Type Description Default
period str

The conversation period to load. One of:

  • "before_1" — First conversation, before the planned march
  • "before_2" — Second conversation, before the planned march
  • "after" — Conversation run after the march was suppressed
required
translate_to str or None

Target language code (e.g., "en", "fr") for translating statement text. Defaults to None (no translation).

None

Returns:

Name Type Description
adata AnnData

AnnData object containing the loaded Polis conversation.

Examples:

Load the post-protest conversation:

adata = val.datasets.cuba_protest(period="after")

Load the first pre-protest conversation with English translation:

adata = val.datasets.cuba_protest(period="before_1", translate_to="en")
Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project.

Source code in src/valency_anndata/datasets/_load_cuba_protest.py
def cuba_protest(
    period: CubaProtestPeriod | str,
    translate_to: Optional[str] = None,
    **kwargs,
):
    """
    Polis conversations run around Cuba's planned 15N march (November 2021).

    The 15N march was a peaceful protest planned for November 15, 2021, but was
    suppressed by the Cuban government before it could take place. Three
    conversations were run in sequence — two before the planned march and one
    after its suppression — allowing longitudinal comparison of public opinion
    around the event.

    Parameters
    ----------
    period : str
        The conversation period to load. One of:

        - ``"before_1"`` — First conversation, before the planned march
        - ``"before_2"`` — Second conversation, before the planned march
        - ``"after"`` — Conversation run after the march was suppressed

    translate_to : str or None, optional
        Target language code (e.g., ``"en"``, ``"fr"``) for translating
        statement text. Defaults to None (no translation).

    Returns
    -------
    adata : anndata.AnnData
        AnnData object containing the loaded Polis conversation.

    Examples
    --------
    Load the post-protest conversation:

    ```py
    adata = val.datasets.cuba_protest(period="after")
    ```

    Load the first pre-protest conversation with English translation:

    ```py
    adata = val.datasets.cuba_protest(period="before_1", translate_to="en")
    ```

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project.
    """
    if period not in _PERIOD_URLS:
        raise ValueError(f"Unknown period {period!r}. Must be one of: {list(_PERIOD_URLS)}")
    url = _PERIOD_URLS[period]

    adata = val.datasets.polis.load(url, translate_to=translate_to, **kwargs)

    return adata

valency_anndata.datasets.japanchoice

japanchoice(
    topic: JapanChoiceTopic | str,
    translate_to: Optional[str] = None,
    **kwargs,
)

Polis conversations from Japan Choice, a Japanese civic engagement platform.

Japan Choice runs Polis conversations on key policy topics ahead of Japanese elections, allowing citizens to share and compare their views on national issues. Conversations are in Japanese.

See: https://japanchoice.jp/polis

Parameters:

Name Type Description Default
topic str

The policy topic and year to load. One of:

  • "2025_foreign_affairs_security" — Foreign Affairs & Security (2025)
  • "2025_diversity_human_rights" — Diversity & Human Rights (2025)
  • "2025_education_children_old_age" — Education, Children & Old Age Care (2025)
  • "2025_economy_taxation_employment" — Economy, Taxation & Employment (2025)
  • "2026_foreign_affairs_security" — Foreign Affairs & Security (2026)
  • "2026_diversity_human_rights" — Diversity & Human Rights (2026)
  • "2026_education_children_old_age" — Education, Children & Old Age Care (2026)
  • "2026_economy_taxation_employment" — Economy, Taxation & Employment (2026)
required
translate_to str or None

Target language code (e.g., "en", "fr") for translating statement text. Defaults to None (no translation).

None

Returns:

Name Type Description
adata AnnData

AnnData object containing the loaded Polis conversation.

Examples:

Load the 2025 Economy, Taxation & Employment conversation:

adata = val.datasets.japanchoice("2025_economy_taxation_employment")

Load the 2026 Foreign Affairs & Security conversation translated to English:

adata = val.datasets.japanchoice("2026_foreign_affairs_security", translate_to="en")
Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project.

Source code in src/valency_anndata/datasets/_load_japanchoice.py
def japanchoice(
    topic: JapanChoiceTopic | str,
    translate_to: Optional[str] = None,
    **kwargs,
):
    """
    Polis conversations from Japan Choice, a Japanese civic engagement platform.

    Japan Choice runs Polis conversations on key policy topics ahead of Japanese
    elections, allowing citizens to share and compare their views on national issues.
    Conversations are in Japanese.

    See: <https://japanchoice.jp/polis>

    Parameters
    ----------
    topic : str
        The policy topic and year to load. One of:

        - ``"2025_foreign_affairs_security"`` — Foreign Affairs & Security (2025)
        - ``"2025_diversity_human_rights"`` — Diversity & Human Rights (2025)
        - ``"2025_education_children_old_age"`` — Education, Children & Old Age Care (2025)
        - ``"2025_economy_taxation_employment"`` — Economy, Taxation & Employment (2025)
        - ``"2026_foreign_affairs_security"`` — Foreign Affairs & Security (2026)
        - ``"2026_diversity_human_rights"`` — Diversity & Human Rights (2026)
        - ``"2026_education_children_old_age"`` — Education, Children & Old Age Care (2026)
        - ``"2026_economy_taxation_employment"`` — Economy, Taxation & Employment (2026)

    translate_to : str or None, optional
        Target language code (e.g., ``"en"``, ``"fr"``) for translating
        statement text. Defaults to None (no translation).

    Returns
    -------
    adata : anndata.AnnData
        AnnData object containing the loaded Polis conversation.

    Examples
    --------
    Load the 2025 Economy, Taxation & Employment conversation:

    ```py
    adata = val.datasets.japanchoice("2025_economy_taxation_employment")
    ```

    Load the 2026 Foreign Affairs & Security conversation translated to English:

    ```py
    adata = val.datasets.japanchoice("2026_foreign_affairs_security", translate_to="en")
    ```

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project.
    """
    if topic not in _TOPIC_URLS:
        raise ValueError(f"Unknown topic {topic!r}. Must be one of: {list(_TOPIC_URLS)}")
    url = _TOPIC_URLS[topic]

    adata = val.datasets.polis.load(url, translate_to=translate_to, **kwargs)

    return adata

valency_anndata.datasets.klimarat

klimarat(
    topic: KlimaratTopic | str,
    translate_to: Optional[str] = None,
    **kwargs,
)

Polis conversations from Austria's Citizens' Climate Council (Klimarat).

The Klimarat der Bürgerinnen und Bürger was Austria's national citizens' assembly on climate policy, convened in 2021–2022. Polis conversations were run for each of five topic areas to gather public input.

See: https://klimarat.org/

Parameters:

Name Type Description Default
topic str

The topic area to load. One of:

  • "food_land" — Food & Land Use
  • "mobility" — Mobility
  • "energy" — Energy
  • "housing" — Housing
  • "production" — Production & Consumption
required
translate_to str or None

Target language code (e.g., "en", "fr") for translating statement text. Defaults to None (no translation).

None

Returns:

Name Type Description
adata AnnData

AnnData object containing the loaded Polis conversation.

Examples:

Load the Energy topic conversation:

adata = val.datasets.klimarat(topic="energy")

Load the Food & Land Use topic conversation with English translation:

adata = val.datasets.klimarat(topic="food_land", translate_to="en")
Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project.

Source code in src/valency_anndata/datasets/_load_klimarat.py
def klimarat(
    topic: KlimaratTopic | str,
    translate_to: Optional[str] = None,
    **kwargs,
):
    """
    Polis conversations from Austria's Citizens' Climate Council (Klimarat).

    The Klimarat der Bürgerinnen und Bürger was Austria's national citizens'
    assembly on climate policy, convened in 2021–2022. Polis conversations were
    run for each of five topic areas to gather public input.

    See: <https://klimarat.org/>

    Parameters
    ----------
    topic : str
        The topic area to load. One of:

        - ``"food_land"`` — Food & Land Use
        - ``"mobility"`` — Mobility
        - ``"energy"`` — Energy
        - ``"housing"`` — Housing
        - ``"production"`` — Production & Consumption

    translate_to : str or None, optional
        Target language code (e.g., ``"en"``, ``"fr"``) for translating
        statement text. Defaults to None (no translation).

    Returns
    -------
    adata : anndata.AnnData
        AnnData object containing the loaded Polis conversation.

    Examples
    --------
    Load the Energy topic conversation:

    ```py
    adata = val.datasets.klimarat(topic="energy")
    ```

    Load the Food & Land Use topic conversation with English translation:

    ```py
    adata = val.datasets.klimarat(topic="food_land", translate_to="en")
    ```

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project.
    """
    if topic not in _TOPIC_URLS:
        raise ValueError(f"Unknown topic {topic!r}. Must be one of: {list(_TOPIC_URLS)}")
    url = _TOPIC_URLS[topic]

    adata = val.datasets.polis.load(url, translate_to=translate_to, **kwargs)

    return adata

valency_anndata.datasets.vtaiwan

vtaiwan(
    topic: VTaiwanTopic | str,
    translate_to: Optional[str] = None,
    **kwargs,
)

Polis conversations from the vTaiwan collaborative policymaking process.

vTaiwan is a civic deliberation process initiated in 2014 by the g0v community in Taiwan, using Polis to gather citizen perspectives on digital governance and social policy issues. These conversations are in Traditional Chinese.

See: https://info.vtaiwan.tw

Parameters:

Name Type Description Default
topic str

The policy topic to load. One of:

  • "uber" — Regulation of Uber and ride-sharing services (2015)
  • "airbnb" — Regulation of Airbnb and home-sharing services (2015)
  • "online_alcohol" — Online alcohol sales regulation (2016)
  • "caning" — Caning as a criminal punishment (2017)
required
translate_to str or None

Target language code (e.g., "en", "fr") for translating statement text. Defaults to None (no translation).

None

Returns:

Name Type Description
adata AnnData

AnnData object containing the loaded Polis conversation.

Examples:

Load the Uber conversation:

adata = val.datasets.vtaiwan(topic="uber")

Load the Airbnb conversation translated to English:

adata = val.datasets.vtaiwan(topic="airbnb", translate_to="en")
Attribution

Data was gathered using the Polis software (see: https://compdemocracy.org/polis and https://github.com/compdemocracy/polis) and is sub-licensed under CC BY 4.0 with Attribution to The Computational Democracy Project.

Source code in src/valency_anndata/datasets/_load_vtaiwan.py
def vtaiwan(
    topic: VTaiwanTopic | str,
    translate_to: Optional[str] = None,
    **kwargs,
):
    """
    Polis conversations from the vTaiwan collaborative policymaking process.

    vTaiwan is a civic deliberation process initiated in 2014 by the g0v
    community in Taiwan, using Polis to gather citizen perspectives on digital
    governance and social policy issues. These conversations are in Traditional
    Chinese.

    See: <https://info.vtaiwan.tw>

    Parameters
    ----------
    topic : str
        The policy topic to load. One of:

        - ``"uber"`` — Regulation of Uber and ride-sharing services (2015)
        - ``"airbnb"`` — Regulation of Airbnb and home-sharing services (2015)
        - ``"online_alcohol"`` — Online alcohol sales regulation (2016)
        - ``"caning"`` — Caning as a criminal punishment (2017)

    translate_to : str or None, optional
        Target language code (e.g., ``"en"``, ``"fr"``) for translating
        statement text. Defaults to None (no translation).

    Returns
    -------
    adata : anndata.AnnData
        AnnData object containing the loaded Polis conversation.

    Examples
    --------
    Load the Uber conversation:

    ```py
    adata = val.datasets.vtaiwan(topic="uber")
    ```

    Load the Airbnb conversation translated to English:

    ```py
    adata = val.datasets.vtaiwan(topic="airbnb", translate_to="en")
    ```

    Attribution
    -----------

    Data was gathered using the Polis software (see:
    <https://compdemocracy.org/polis> and
    <https://github.com/compdemocracy/polis>) and is sub-licensed under CC BY
    4.0 with Attribution to The Computational Democracy Project.
    """
    if topic not in _TOPIC_URLS:
        raise ValueError(f"Unknown topic {topic!r}. Must be one of: {list(_TOPIC_URLS)}")
    url = _TOPIC_URLS[topic]

    adata = val.datasets.polis.load(url, translate_to=translate_to, **kwargs)

    return adata

Polis

valency_anndata.datasets.polis.load

load(
    source: str,
    *,
    translate_to: Optional[str] = None,
    build_X: bool = True,
    trim_rule: int | float | str = 1.0,
    skip_cache: bool = False,
    show_progress: bool = True,
    include_precomputed_groups: bool = False,
) -> AnnData

Load a Polis conversation or report into an AnnData object.

This function accepts either a URL or an ID for a Polis conversation or report, fetches raw vote events and statements via the Polis API or CSV export, and optionally constructs a participant × statement vote matrix in adata.X.

Parameters:

Name Type Description Default
source str

The Polis source to load. Supported formats include:

  • Full report URL: https://pol.is/report/<report_id>
  • Conversation URL: https://pol.is/<conversation_id>
  • Custom host URLs: https://<host>/report/<report_id> or https://<host>/<conversation_id>
  • Bare IDs:
    • Conversation ID (starts with a digit), e.g., 4asymkcrjf
    • Report ID (starts with 'r'), e.g., r4zdxrdscmukmkakmbz3k
  • HuggingFace dataset slug: hf:<user>/<dataset> or huggingface:<user>/<dataset>
  • Local directory containing CSV exports:
    • *votes.csv
    • *comments.csv

The function will automatically parse the source to determine whether it refers to a conversation or report and fetch the appropriate data.

required
translate_to str or None

Target language code (e.g., "en", "fr", "es") for translating statement text. If provided, the original statement text in adata.uns["statements"]["comment-body"] is translated and stored in adata.var["content"]. The adata.var["language_current"] field is updated to the target language, and adata.var["is_translated"] is set to True. Defaults to None (no translation).

None
build_X bool

If True, constructs a participant × statement vote matrix from the raw votes using rebuild_vote_matrix(). This populates adata.obs, adata.var, and adata.X (with a copy in adata.layers['raw_sparse']). After the first build, a snapshot of this initial matrix is stored in adata.raw.

True
trim_rule int or float or str

Controls how votes are trimmed by timestamp before building the vote matrix. Passed directly to :func:valency_anndata.preprocessing.rebuild_vote_matrix. The default 1.0 keeps all votes. Examples:

  • 0.75 — keep the first 75% of votes by timestamp
  • 50 — keep the first 50% of votes (integer percent)
  • 1_700_000_000 — keep votes up to this Unix timestamp cutoff
  • "mean-2std" — keep votes within mean − 2 × std of timestamps

Only has effect when build_X=True.

1.0
skip_cache bool

If True, bypass the local file cache and always fetch fresh data from the network. Cached files expire automatically after 24 hours.

False
show_progress bool

If True, display a progress bar when fetching votes from the API (conversation URL/ID only). Uses tqdm, which auto-detects notebooks vs terminal. Has no effect when loading from a report URL or local directory.

True
include_precomputed_groups bool

If True, fetch the Polis math endpoint and store the precomputed group cluster assignments produced by the Polis server in adata.obs["kmeans_polis_precomputed"] (nullable Int64). The raw math dict is also stored in adata.uns["polis_math"]. Only supported for API/report sources; raises ValueError for local directory sources.

False

Returns:

Name Type Description
adata AnnData

An AnnData object containing the loaded Polis data.

DataFrame

adata.uns["votes"]
Raw vote events fetched from the API or CSV export.

dict

adata.uns["votes_meta"]
Metadata about the sources of votes, e.g., API vs CSV.

DataFrame

adata.uns["statements"]
Raw statements/comments for the conversation.

dict

adata.uns["statements_meta"]
Metadata about the statements source.

dict

adata.uns["source"]
Basic information about the Polis source (base URL, conversation ID, report ID).

dict

adata.uns["schema"]
High-level description of X and votes.

ndarray

adata.X (if build_X=True)
Participant × statement vote matrix (rows = participants, columns = statements).

DataFrame

adata.obs (if build_X=True)
Participant metadata (index = voter IDs).

DataFrame

adata.var (if build_X=True)
Statement metadata (index = statement IDs).

AnnData

adata.raw (if build_X=True) Snapshot of the first vote matrix and associated metadata. This allows downstream filtering or processing without losing the original vote matrix.

Series

adata.obs["kmeans_polis_precomputed"] (if include_precomputed_groups=True) Nullable Int64 cluster labels from Polis's precomputed grouping.

str

adata.uns["polis_math"] (if include_precomputed_groups=True) Raw math from the Polis API serialized as a JSON string (use json.loads(adata.uns["polis_math"]) to get the dict), stored this way because h5py cannot serialize deeply nested list-of-dict structures.

Notes
  • If build_X=False, only adata.uns will be populated, containing the raw votes and statements, and .X, .obs, .var, and .raw will remain empty.
  • adata.raw is assigned only after the first vote matrix build and is intended to be immutable.
  • If translate_to is provided, adata.var["content"] is updated with translated text and adata.var["language_current"] is set to the target language.
  • The vote matrix is derived from the most recent votes per participant per statement, sorted by timestamp.

Examples:

Load data from a report or conversation ID or URL.

adata = val.datasets.polis.load("https://pol.is/report/r2dfw8eambusb8buvecjt")
adata = val.datasets.polis.load("6rphtwwfn4")

Load data from an alternative Polis instance via URL.

adata = val.datasets.polis.load("https://polis.tw/6rphtwwfn4")

Load data from a HuggingFace dataset.

adata = val.datasets.polis.load("hf:patcon/polis-aufstehen-2018")

Load data from a path containing Polis CSV export files.

$ ls exports/my_conversation_2024_11_03
comments.csv votes.csv summary.csv ...
adata = val.datasets.polis.load("./exports/my_conversation_2024_11_03")
Source code in src/valency_anndata/datasets/polis.py
def load(source: str, *, translate_to: Optional[str] = None, build_X: bool = True, trim_rule: int | float | str = 1.0, skip_cache: bool = False, show_progress: bool = True, include_precomputed_groups: bool = False) -> AnnData:
    """
    Load a Polis conversation or report into an AnnData object.

    This function accepts either a URL or an ID for a Polis conversation or report,
    fetches raw vote events and statements via the Polis API or CSV export, and
    optionally constructs a participant × statement vote matrix in `adata.X`.

    Parameters
    ----------
    source : str
        The Polis source to load. Supported formats include:

        - Full report URL: `https://pol.is/report/<report_id>`
        - Conversation URL: `https://pol.is/<conversation_id>`
        - Custom host URLs: `https://<host>/report/<report_id>` or `https://<host>/<conversation_id>`
        - Bare IDs:
            - Conversation ID (starts with a digit), e.g., `4asymkcrjf`
            - Report ID (starts with 'r'), e.g., `r4zdxrdscmukmkakmbz3k`
        - HuggingFace dataset slug: ``hf:<user>/<dataset>`` or ``huggingface:<user>/<dataset>``
        - Local directory containing CSV exports:
            - *votes.csv
            - *comments.csv

        The function will automatically parse the source to determine whether
        it refers to a conversation or report and fetch the appropriate data.


    translate_to : str or None, optional
        Target language code (e.g., "en", "fr", "es") for translating statement text.
        If provided, the original statement text in `adata.uns["statements"]["comment-body"]`
        is translated and stored in `adata.var["content"]`. The `adata.var["language_current"]`
        field is updated to the target language, and `adata.var["is_translated"]` is set to True.
        Defaults to None (no translation).

    build_X : bool, default True
        If True, constructs a participant × statement vote matrix from the raw
        votes using `rebuild_vote_matrix()`. This populates `adata.obs`,
        `adata.var`, and `adata.X` (with a copy in
        `adata.layers['raw_sparse']`). After the first build, a snapshot of this
        initial matrix is stored in `adata.raw`.

    trim_rule : int or float or str, default 1.0
        Controls how votes are trimmed by timestamp before building the vote
        matrix. Passed directly to :func:`valency_anndata.preprocessing.rebuild_vote_matrix`.
        The default ``1.0`` keeps all votes. Examples:

        - ``0.75`` — keep the first 75% of votes by timestamp
        - ``50`` — keep the first 50% of votes (integer percent)
        - ``1_700_000_000`` — keep votes up to this Unix timestamp cutoff
        - ``"mean-2std"`` — keep votes within mean − 2 × std of timestamps

        Only has effect when ``build_X=True``.

    skip_cache : bool, default False
        If True, bypass the local file cache and always fetch fresh data from
        the network.  Cached files expire automatically after 24 hours.

    show_progress : bool, default True
        If True, display a progress bar when fetching votes from the API
        (conversation URL/ID only). Uses tqdm, which auto-detects notebooks
        vs terminal. Has no effect when loading from a report URL or local
        directory.

    include_precomputed_groups : bool, default False
        If True, fetch the Polis math endpoint and store the precomputed
        group cluster assignments produced by the Polis server in
        ``adata.obs["kmeans_polis_precomputed"]`` (nullable ``Int64``).
        The raw math dict is also stored in ``adata.uns["polis_math"]``.
        Only supported for API/report sources; raises ``ValueError`` for
        local directory sources.

    Returns
    -------
    adata : anndata.AnnData
        An AnnData object containing the loaded Polis data.


    pd.DataFrame
        `adata.uns["votes"]`  
        Raw vote events fetched from the API or CSV export.
    dict
        `adata.uns["votes_meta"]`  
        Metadata about the sources of votes, e.g., API vs CSV.
    pd.DataFrame
        `adata.uns["statements"]`  
        Raw statements/comments for the conversation.
    dict
        `adata.uns["statements_meta"]`  
        Metadata about the statements source.
    dict
        `adata.uns["source"]`  
        Basic information about the Polis source (base URL, conversation ID, report ID).
    dict
        `adata.uns["schema"]`  
        High-level description of `X` and `votes`.
    np.ndarray
        `adata.X` (if `build_X=True`)  
        Participant × statement vote matrix (rows = participants, columns = statements).
    pd.DataFrame 
        `adata.obs` (if `build_X=True`)  
        Participant metadata (index = voter IDs).
    pd.DataFrame 
        `adata.var` (if `build_X=True`)  
        Statement metadata (index = statement IDs).
    anndata.AnnData
        `adata.raw` (if `build_X=True`)
        Snapshot of the first vote matrix and associated metadata. This allows
        downstream filtering or processing without losing the original vote matrix.
    pd.Series
        `adata.obs["kmeans_polis_precomputed"]` (if `include_precomputed_groups=True`)
        Nullable ``Int64`` cluster labels from Polis's precomputed grouping.
    str
        `adata.uns["polis_math"]` (if `include_precomputed_groups=True`)
        Raw math from the Polis API serialized as a JSON string (use
        ``json.loads(adata.uns["polis_math"])`` to get the dict), stored this
        way because h5py cannot serialize deeply nested list-of-dict structures.

    Notes
    -----
    - If `build_X=False`, only `adata.uns` will be populated, containing the raw
      votes and statements, and `.X`, `.obs`, `.var`, and `.raw` will remain empty.
    - `adata.raw` is assigned only after the first vote matrix build and is intended
      to be immutable.
    - If `translate_to` is provided, `adata.var["content"]` is updated with translated
    text and `adata.var["language_current"]` is set to the target language.
    - The vote matrix is derived from the most recent votes per participant per statement,
      sorted by timestamp.

    Examples
    --------

    Load data from a report or conversation ID or URL.

    ```py
    adata = val.datasets.polis.load("https://pol.is/report/r2dfw8eambusb8buvecjt")
    adata = val.datasets.polis.load("6rphtwwfn4")
    ```

    Load data from an alternative Polis instance via URL.

    ```py
    adata = val.datasets.polis.load("https://polis.tw/6rphtwwfn4")
    ```

    Load data from a HuggingFace dataset.

    ```py
    adata = val.datasets.polis.load("hf:patcon/polis-aufstehen-2018")
    ```

    Load data from a path containing Polis CSV export files.

    ```sh
    $ ls exports/my_conversation_2024_11_03
    comments.csv votes.csv summary.csv ...
    ```

    ```py
    adata = val.datasets.polis.load("./exports/my_conversation_2024_11_03")
    ```
    """
    adata = _load_raw_polis_data(source, skip_cache=skip_cache, show_progress=show_progress)

    if build_X:
        rebuild_vote_matrix(adata, trim_rule=trim_rule, inplace=True)
        adata.raw = adata.copy()
        # Store a copy in case we bring something else into X workspace later.
        adata.layers["raw_sparse"] = adata.X # type: ignore[arg-type]

    _populate_var_statements(adata, translate_to=translate_to)

    if include_precomputed_groups:
        _add_precomputed_groups(adata)

    # if convo_meta.conversation_id:
    #     xids = client.get_xids(conversation_id=convo_meta.conversation_id)
    #     adata.uns["xids"] = pd.DataFrame(xids)

    return adata

valency_anndata.datasets.polis.export_csv

export_csv(
    adata: AnnData,
    path: str,
    *,
    include_huggingface_metadata: bool = False,
) -> None

Export an AnnData object to Polis CSV format (votes.csv + comments.csv).

Writes two of the five files from a full Polis data export:

  • votes.csv — vote event log (timestamp, datetime, comment-id, voter-id, vote)
  • comments.csv — statement metadata (timestamp, datetime, comment-id, author-id, agrees, disagrees, moderated, comment-body)

The remaining three export files are not yet supported: summary.csv, participant-votes.csv (vote matrix), and comment-groups.csv.

Agrees/disagrees are computed from the vote matrix in adata.X.

Parameters:

Name Type Description Default
adata AnnData

AnnData object produced by :func:load. Must have adata.uns["votes"] and adata.uns["statements"] populated, and adata.X built (i.e. loaded with build_X=True).

required
path str

Directory to write the CSV files into. Created if it does not exist.

required
include_huggingface_metadata bool

If True, write a README.md with YAML frontmatter suitable for uploading the export directory as a HuggingFace dataset.

False

Examples:

>>> adata = val.datasets.polis.load("5huyhtuvrm")
>>> val.datasets.polis.export_csv(adata, "./my_export")
Source code in src/valency_anndata/datasets/polis.py
def export_csv(adata: AnnData, path: str, *, include_huggingface_metadata: bool = False) -> None:
    """
    Export an AnnData object to Polis CSV format (votes.csv + comments.csv).

    Writes two of the five files from a full Polis data export:

    - ``votes.csv`` — vote event log (timestamp, datetime, comment-id,
      voter-id, vote)
    - ``comments.csv`` — statement metadata (timestamp, datetime, comment-id,
      author-id, agrees, disagrees, moderated, comment-body)

    The remaining three export files are not yet supported:
    ``summary.csv``, ``participant-votes.csv`` (vote matrix),
    and ``comment-groups.csv``.

    Agrees/disagrees are computed from the vote matrix in ``adata.X``.

    Parameters
    ----------
    adata : anndata.AnnData
        AnnData object produced by :func:`load`.  Must have ``adata.uns["votes"]``
        and ``adata.uns["statements"]`` populated, and ``adata.X`` built
        (i.e. loaded with ``build_X=True``).
    path : str
        Directory to write the CSV files into.  Created if it does not exist.
    include_huggingface_metadata : bool, default False
        If True, write a ``README.md`` with YAML frontmatter suitable for
        uploading the export directory as a HuggingFace dataset.

    Examples
    --------
    >>> adata = val.datasets.polis.load("5huyhtuvrm")
    >>> val.datasets.polis.export_csv(adata, "./my_export")
    """
    import numpy as np

    output_dir = Path(path)
    output_dir.mkdir(parents=True, exist_ok=True)

    # ── votes.csv ──
    votes = adata.uns["votes"].copy()
    votes["timestamp"] = _to_seconds(votes["timestamp"])

    if "datetime" not in votes.columns:
        votes["datetime"] = pd.to_datetime(votes["timestamp"], unit="s").dt.strftime(
            "%a %b %d %Y %H:%M:%S GMT+0000 (Coordinated Universal Time)"
        )

    votes.sort_values(["comment-id", "voter-id"], inplace=True)

    vote_cols = ["timestamp", "datetime", "comment-id", "voter-id", "vote"]
    vote_cols = [c for c in vote_cols if c in votes.columns]
    votes_path = output_dir / "votes.csv"
    votes[vote_cols].to_csv(votes_path, index=False)
    print(f"Wrote {len(votes)} vote rows to {votes_path}")

    # ── comments.csv ──
    statements = adata.uns["statements"].copy()
    if statements.index.name == "comment-id":
        statements = statements.reset_index()

    # Compute agrees/disagrees from the vote matrix, aligned by comment-id
    X = adata.X
    vote_counts = pd.DataFrame(
        {
            "agrees": np.nansum(X == 1, axis=0).astype(int),
            "disagrees": np.nansum(X == -1, axis=0).astype(int),
        },
        index=adata.var_names.astype(int),
    )
    vote_counts.index.name = "comment-id"
    statements = statements.merge(vote_counts, on="comment-id", how="left")
    statements["agrees"] = statements["agrees"].fillna(0).astype(int)
    statements["disagrees"] = statements["disagrees"].fillna(0).astype(int)

    if "timestamp" in statements.columns:
        statements["timestamp"] = _to_seconds(statements["timestamp"])

    if "datetime" not in statements.columns and "timestamp" in statements.columns:
        statements["datetime"] = pd.to_datetime(
            statements["timestamp"], unit="s"
        ).dt.strftime("%a %b %d %Y %H:%M:%S GMT+0000 (Coordinated Universal Time)")

    comment_cols = [
        "timestamp", "datetime", "comment-id", "author-id",
        "agrees", "disagrees", "moderated", "comment-body",
        "is-seed", "is-meta",
    ]
    comment_cols = [c for c in comment_cols if c in statements.columns]
    comments_path = output_dir / "comments.csv"
    statements[comment_cols].to_csv(comments_path, index=False)
    print(f"Wrote {len(statements)} statement rows to {comments_path}")

    if include_huggingface_metadata:
        _write_huggingface_readme(adata, output_dir)

valency_anndata.datasets.polis.translate_statements

translate_statements(
    adata: AnnData,
    translate_to: Optional[str],
    inplace: bool = True,
) -> Optional[list[str]]

Translate statements in adata.uns['statements']['comment-body'] into another language, or copy originals if translate_to is None.

Parameters:

Name Type Description Default
adata AnnData

AnnData object containing uns['statements'] and var_names.

required
translate_to Optional[str]

Target language code (e.g., "en", "fr", "es").

required
inplace bool

If True, updates adata.var['content'] and adata.var['language_current']. If False, returns a list of translated strings without modifying adata.

True

Returns:

Name Type Description
translated_texts list[str] | None

List of translated texts if inplace=False, else None.

Source code in src/valency_anndata/datasets/polis.py
def translate_statements(
    adata: AnnData,
    translate_to: Optional[str],
    inplace: bool = True
) -> Optional[list[str]]:
    """
    Translate statements in `adata.uns['statements']['comment-body']` into another language,
    or copy originals if translate_to is None.

    Parameters
    ----------
    adata : AnnData
        AnnData object containing `uns['statements']` and `var_names`.
    translate_to : Optional[str]
        Target language code (e.g., "en", "fr", "es").
    inplace : bool, default True
        If True, updates `adata.var['content']` and `adata.var['language_current']`.
        If False, returns a list of translated strings without modifying `adata`.

    Returns
    -------
    translated_texts : list[str] | None
        List of translated texts if `inplace=False`, else None.
    """
    statements_aligned = adata.uns["statements"].copy()
    statements_aligned.index = statements_aligned.index.astype(str)
    statements_aligned = statements_aligned.reindex(adata.var_names)

    original_texts = statements_aligned["comment-body"].tolist()

    # ───────────────────────────────────────────
    # NO-TRANSLATION PATH (explicit)
    # ───────────────────────────────────────────
    if translate_to is None:
        if inplace:
            adata.var["content"] = original_texts
            adata.var["language_current"] = adata.var["language_original"]
            adata.var["is_translated"] = False
            return None
        else:
            return original_texts


    # ───────────────────────────────────────────
    # TRANSLATION PATH
    # ───────────────────────────────────────────
    translated_texts = run_async(
        _translate_texts_async(original_texts, translate_to)
    )

    if inplace:
        adata.var["content"] = translated_texts
        adata.var["language_current"] = translate_to
        adata.var["is_translated"] = True
        return None
    else:
        return translated_texts

  1. Kept / total participants. Participants with fewer than 7 votes are excluded. 

  2. Kept / total statements. Statements with fewer than 2 votes are excluded. 

  3. Vote matrix completeness at each quartile of statements (25% / 50% / 75% / 100%), ordered by statement ID. Each value is the % of non-missing votes across all kept participants × the first N statements. Statement counts per quartile are shown in parentheses.