Bringing the Tidyverse to Python: plotnine, siuba, and pyjanitor

R shines in data analysis thanks to the tidyverse for intuitive data wrangling and ggplot2 for high-quality visualization though very large datasets often require tools like DelayedArray or DuckDB. Python, meanwhile, leads in deep learning and scales naturally for pipelines and application development, but its data manipulation can feel less fluid and its plotting options less cohesive. In today’s post, we’ll explore how packages like plotnine, siuba, and pyjanitor help close the gap between the two ecosystems.

As described in my last post, Use uv for python and forget about anything else!, I used uv to create an environment and install all python packages on my mac laptop in less than a second. Just create a new folder for the project, open a terminal and run

1uv init
2uv add pandas siuba pyjanitor plotnine

This will create a new Python virtual environment and install the required packages.

While writing the post I needed additional packages, it was a easy as opening a terminal on the folder of my project and running uv add again.

1uv add ipykernel
2uv add setuptools
3uv add scanpy
4uv add certifi
5uv add itertools

Now in Jupyter notebook or wherever you like coding Python just load the packages as usual

1import pandas as pd
2from siuba import _, filter, mutate, summarize, group_by, arrange
3import janitor
4from plotnine import *

We will start building a ggplot with the classical R dataset mtcars. Load dataset as follows.

1from siuba.data import mtcars
/Users/alfonsosaeravila/Documents/plotnine test/.venv/lib/python3.13/site-packages/siuba/data/__init__.py:21: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.

Some data manipulation to filter and add new variables.

 1df = (
 2    mtcars
 3    >> filter(_.mpg > 20)                            # Equivalent to filter(mpg > 20)
 4    >> mutate(power_to_weight = _.hp / _.wt)         # Equivalent to mutate(power_to_weight = hp / wt)
 5    >> group_by(_.cyl)
 6    >> summarize(avg_mpg = _.mpg.mean(),
 7                 avg_power_to_weight = _.power_to_weight.mean())
 8)
 9
10print(df)
   cyl    avg_mpg  avg_power_to_weight
0    4  26.663636            37.925332
1    6  21.133333            38.153407


/Users/alfonsosaeravila/Documents/plotnine test/.venv/lib/python3.13/site-packages/siuba/dply/verbs.py:581: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

ggplot needs the data in long format. Let's prep the data for the plot as follows.

1df_long = (
2    df
3    .pivot_longer(
4        index="cyl",
5        names_to="variable",
6        values_to="value"
7    )
8)
9print(df_long)
   cyl             variable      value
0    4              avg_mpg  26.663636
1    6              avg_mpg  21.133333
2    4  avg_power_to_weight  37.925332
3    6  avg_power_to_weight  38.153407

Now just make the plot using ggplot, you can see how similar the syntax is to R

 1(
 2    ggplot(df_long, aes(x="factor(cyl)", y="value", fill="variable"))
 3    + geom_col(position="dodge")
 4    + labs(
 5        title="Fuel efficiency and power-to-weight ratio by cylinder count",
 6        x="Cylinders",
 7        y="Average value",
 8        fill="Metric"
 9    )
10    + theme_bw()
11)
Matplotlib is building the font cache; this may take a moment.

png

And that is it! We have created our first ggplot plot in Python.

Now let's apply it to a single cell dataset. We will do the test creating a barplot with the mean expression of several genes by cluster, and present each gene in a single plot (ggplot facet style).

I will use the pbmc3k dataset that comes alreay pre-processed with scanpy.

1import scanpy as sc
2adata = sc.datasets.pbmc3k_processed()
WARNING: Failed to open the url with default certificates. Trying to use certifi.


100%|██████████| 23.5M/23.5M [00:04<00:00, 5.70MB/s]

Processed 3k PBMCs from 10x Genomics.
Processed using the basic tutorial Preprocessing and clustering 3k PBMCs (legacy workflow).
For preprocessing, cells are filtered out that have few gene counts or too high a percent_mito. The counts are logarithmized and only genes marked by highly_variable_genes() are retained. The obs variables n_counts and percent_mito are corrected for using regress_out(), and values are scaled and clipped by scale(). Finally, pca() and neighbors() are calculated.

We can explore the adata object.

1adata
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

and the gene expression.

1adata.to_df()

index TNFRSF4 CPSF3L ATAD3C C1orf86 RER1 TNFRSF25 TNFRSF9 CTNNBIP1 SRM UBIAD1 ... DSCR3 BRWD1 BACE2 SIK1 C21orf33 ICOSLG SUMO3 SLC19A1 S100B PRMT2
index
AAACATACAACCAC-1 -0.171470 -0.280812 -0.046677 -0.475169 -0.544024 4.928495 -0.038028 -0.280573 -0.341788 -0.195361 ... -0.226570 -0.236269 -0.102943 -0.222116 -0.312401 -0.121678 -0.521229 -0.098269 -0.209095 -0.531203
AAACATTGAGCTAC-1 -0.214582 -0.372653 -0.054804 -0.683391 0.633951 -0.334837 -0.045589 -0.498264 -0.541914 -0.209017 ... -0.317530 2.568866 0.007155 -0.445372 1.629285 -0.058662 -0.857164 -0.266844 -0.313146 -0.596654
AAACATTGATCAGC-1 -0.376887 -0.295084 -0.057528 -0.520972 1.332647 -0.309362 -0.103108 -0.272526 -0.500798 -0.220228 ... -0.302938 -0.239801 -0.071774 -0.297857 -0.410920 -0.070431 -0.590721 -0.158656 -0.170876 1.379000
AAACCGTGCTTCCG-1 -0.285241 -0.281735 -0.052227 -0.484929 1.572679 -0.271825 -0.074552 -0.258876 -0.416752 -0.208471 ... -0.262978 -0.231807 -0.093818 -0.247770 2.552078 -0.097402 1.631685 -0.119462 -0.179120 -0.505670
AAACCGTGTATGCG-1 -0.256483 -0.220394 -0.046800 -0.345859 -0.333409 -0.208122 -0.069514 5.806442 -0.283112 -0.199355 ... -0.202237 -0.176765 -0.167350 -0.098665 -0.275836 -0.139482 -0.310096 -0.006877 -0.109614 -0.461946
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTTCGAACTCTCAT-1 -0.290368 2.638305 -0.054510 -0.554385 -0.666646 -0.301333 -0.074079 -0.334131 -0.478076 -0.211993 ... 2.865903 -0.259984 -0.057421 -0.320982 -0.396917 -0.078147 1.232168 -0.174593 -0.216714 -0.529870
TTTCTACTGAGGCA-1 -0.386343 2.652696 -0.058686 -0.545443 1.201865 -0.321670 -0.105418 -0.296857 1.803535 -0.222323 ... -0.314890 -0.249159 -0.058679 -0.324692 -0.427120 -0.062188 -0.630103 -0.178990 -0.181736 -0.502022
TTTCTACTTCCTCG-1 -0.207089 -0.250464 -0.046397 -0.409737 2.193953 -0.221747 -0.051566 -0.198130 -0.307756 -0.196557 ... -0.212130 -0.206711 10.000000 -0.158643 3.308512 -0.132098 2.264174 -0.051144 -0.161064 2.041497
TTTGCATGAGAGGC-1 -0.190328 -0.226334 -0.043999 -0.354661 -0.350005 -0.195177 -0.047832 -0.142079 -0.251677 -0.192347 ... -0.186529 -0.185312 -0.165108 -0.098862 -0.256393 -0.149789 -0.325824 -0.005918 -0.135213 -0.482111
TTTGCATGCCTCAC-1 -0.333789 -0.253588 -0.052716 -0.425292 -0.457937 5.322871 -0.092097 -0.179143 -0.395172 -0.211355 ... -0.254442 -0.203669 -0.123191 -0.192104 -0.345354 -0.103831 -0.436880 -0.078424 -0.130327 -0.471338

2638 rows × 1838 columns

I am not sure why some expression values are negative. In any case, it will not impact the test so I just move on.

We define the top expressed genes as follows.

1import numpy as np
2scores = adata.X.sum(axis=0)
3indexes = np.argsort(-scores)[:6]
4genes = [adata.var_names[i] for i in indexes]

and extract gene expression values

1expr_df = adata[:, genes].to_df()        # cells x selected genes
2expr_df["louvain"] = adata.obs["louvain"].astype(str).values
3expr_df["cell_id"] = expr_df.index

Now we will calculate the mean expression of each gene for each louvain cluster as follows:

 1plot_df = (
 2    expr_df
 3    .pivot_longer(
 4        index=["cell_id","louvain"],
 5        names_to="gene",
 6        values_to="expr"
 7    )
 8    >> group_by(_.louvain, _.index)
 9    >> summarize(mean_expr = _.expr.mean())
10    >> arrange(_.louvain, _.index)
11)
/Users/alfonsosaeravila/Documents/plotnine test/.venv/lib/python3.13/site-packages/siuba/dply/verbs.py:581: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

The last step is plotting the data.

 1(
 2    ggplot(plot_df, aes(x="index", y="mean_expr", fill = 'index'))
 3    + geom_col()
 4    + facet_wrap("~ louvain")
 5    + theme(axis_text_x=element_text(rotation=90, ha="center"), legend_position='none')
 6    + labs(
 7        x="Gene",
 8        y="Mean expression",
 9        title="Mean expression per gene, faceted by Louvain cluster"
10    )
11)

png

Take home messages

  • Python can mimic the tidyverse workflow using siuba for dplyr-style data manipulation, pyjanitor for clean and expressive pipelines, and plotnine for ggplot2-like graphics.
  • The transition from R to Python feels natural when using these packages: you keep the grammar, readability, and composability of R’s ecosystem while working inside Python’s scalable, production-oriented environment.
  • These tools scale to real bioinformatics datasets, allowing tidyverse-style wrangling and ggplot-style visualization directly on objects like AnnData—making Python a stronger end-to-end option for single-cell workflows.

Further reading

comments powered by Disqus