BiocPy: Bioconductor in Python

Bioinformatics in R has long benefited from the Bioconductor ecosystem, which offers rigorously designed data structures like SummarizedExperiment and GRanges, along with a rich library of domain-specific tools. Python, on the other hand, dominates machine learning and general-purpose development, yet historically lacked a unified framework for genomic data representation and analysis. biocpy aims to bridge this divide, bringing Bioconductor-style data models, interoperability, and workflows into the Python world. In today’s post, we’ll look at how biocpy continues this theme, extending the effort to close the gap between R and Python for genomic analysis. You can also check out my earlier exploration of tidy-style tools in Python here.

As described in my previous uv post, I used uv to create a fresh environment and install the Python packages needed for this walkthrough. Setup is as simple as initializing a project and adding packages. Here, I’m installing singlecellexperiment package to test a quick single-cell workflow..

1uv init biocpy_test
2cd biocpy_test
3uv add singlecellexperiment

this installs everything in less than 2 seconds

 1Using CPython 3.11.14
 2Creating virtual environment at: .venv
 3Resolved 12 packages in 913ms
 4Prepared 4 packages in 421ms
 5warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
 6         If the cache and target directories are on different filesystems, hardlinking may not be supported.
 7         If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
 8Installed 11 packages in 79ms
 9 + biocframe==0.6.3
10 + biocutils==0.2.3
11 + delayedarray==0.6.1
12 + genomicranges==0.7.3
13 + iranges==0.5.4
14 + knncolle==0.2.0
15 + mattress==0.3.1
16 + numpy==2.3.5
17 + scranpy==0.2.2
18 + singlecellexperiment==0.5.9
19 + summarizedexperiment==0.5.5

I keep seeing those warnings but everything works fine so I do not worry too much.

For this example, we’ll use the same blood-cell dataset from the CellxGene database that I previously explored in my scanpy demo post.

1wget https://datasets.cellxgene.cziscience.com/582149d8-2a8f-44cf-9605-337b8ca8d518.h5ad

It is an .h5ad file so we will also need the anndata package

1uv add anndata

again, installation is nearly instantaneous. Resolved 30 packages in 320ms and Installed 18 packages in 76ms

we start importing the packages

1import anndata as ad

then read the data

1adata = ad.read_h5ad('582149d8-2a8f-44cf-9605-337b8ca8d518.h5ad')

and finally check the loaded object.

1adata
AnnData object with n_obs × n_vars = 85233 × 61759
    obs: 'donor_id', 'tissue_in_publication', 'anatomical_position', 'method', 'cdna_plate', 'library_plate', 'notes', 'cdna_well', 'assay_ontology_term_id', 'sample_id', 'replicate', '10X_run', 'ambient_removal', 'donor_method', 'donor_assay', 'donor_tissue', 'donor_tissue_assay', 'cell_type_ontology_term_id', 'compartment', 'broad_cell_class', 'free_annotation', 'manually_annotated', 'published_2022', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ercc', 'pct_counts_ercc', '_scvi_batch', '_scvi_labels', 'scvi_leiden_donorassay_full', 'ethnicity_original', 'scvi_leiden_res05_tissue', 'sample_number', 'organism_ontology_term_id', 'suspension_type', 'tissue_type', 'disease_ontology_term_id', 'is_primary_data', 'tissue_ontology_term_id', 'sex_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'ensembl_id', 'genome', 'mt', 'ercc', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'mean', 'std', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: '_scvi_manager_uuid', '_scvi_uuid', '_training_mode', 'assay_ontology_term_id_colors', 'citation', 'compartment_colors', 'donor_id_colors', 'leiden', 'method_colors', 'neighbors', 'pca', 'schema_reference', 'schema_version', 'sex_ontology_term_id_colors', 'tissue_in_publication_colors', 'title', 'umap'
    obsm: 'X_pca', 'X_scvi', 'X_tissue_uncorrected_umap', 'X_umap', 'X_umap_scvi_full_donorassay', 'X_umap_tissue_scvi_donorassay', 'X_uncorrected_umap'
    varm: 'PCs'
    layers: 'decontXcounts', 'scale_data'
    obsp: 'connectivities', 'distances'

now we import SCE and convert the anndata object

1from singlecellexperiment import SingleCellExperiment
2
3sce = SingleCellExperiment.from_anndata(adata)
4print(sce)
class: SingleCellExperiment
dimensions: (61759, 85233)
assays(3): ['decontXcounts', 'scale_data', 'X']
row_data columns(16): ['ensembl_id', 'genome', 'mt', ..., 'feature_biotype', 'feature_length', 'feature_type']
row_names(61759): ['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', ..., 'ENSG00000290164', 'ENSG00000290165', 'ENSG00000290166']
column_data columns(53): ['donor_id', 'tissue_in_publication', 'anatomical_position', ..., 'self_reported_ethnicity', 'development_stage', 'observation_joinid']
column_names(85233): ['TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_G15', 'TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_L11', 'TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_J16', ..., 'TSP10_Blood_NA_10X_1_1_Enriched_TTTGTTGAGGGAGGTG', 'TSP10_Blood_NA_10X_1_1_Enriched_TTTGTTGTCCAAGCCG', 'TSP10_Blood_NA_10X_1_1_Enriched_TTTGTTGTCCGCGATG']
main_experiment_name:  
reduced_dims(0): []
alternative_experiments(0): []
row_pairs(0): []
column_pairs(0): []
metadata(1): uns

From here, we can follow the Orchestrating Single-Cell Analysis (OSCA) workflow using scranpy, the Python version of the scran package.

1import scranpy

we would start defining the mitochondrial genes, but this was already defined in the row_data of the object

1sce.row_data.get_column_names()
Names(['ensembl_id', 'genome', 'mt', 'ercc', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'mean', 'std', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'])

Let's check the number of mitochondrial genes.

1from collections import Counter
2
3Counter(sce.row_data['mt'])
Counter({False: 61722, True: 37})

Interestingly, the full single cell analysis from QC and datafiltering to dimensional reduction and clustering can be run with a single command.

1results = scranpy.analyze(
2    sce,
3    rna_subsets = {
4        "mito": sce.row_data['mt']
5    }
6)

On my machine, the full analysis completes in just 2 minutes and 34 seconds!! which is impressively fast compared with the equivalent R pipeline.

The documentation is still evolvin so I will use dir to explore the contents of the generated object.

1dir(results)
['__annotations__',
 '__class__',
 '__dataclass_fields__',
 '__dataclass_params__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__match_args__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'adt_filtered',
 'adt_markers',
 'adt_normalized',
 'adt_pca',
 'adt_qc_filter',
 'adt_qc_metrics',
 'adt_qc_thresholds',
 'adt_row_names',
 'adt_size_factors',
 'block',
 'clusters',
 'column_names',
 'combined_pca',
 'combined_qc_filter',
 'crispr_filtered',
 'crispr_markers',
 'crispr_normalized',
 'crispr_pca',
 'crispr_qc_filter',
 'crispr_qc_metrics',
 'crispr_qc_thresholds',
 'crispr_row_names',
 'crispr_size_factors',
 'graph_clusters',
 'kmeans_clusters',
 'mnn_corrected',
 'rna_filtered',
 'rna_gene_variances',
 'rna_highly_variable_genes',
 'rna_markers',
 'rna_normalized',
 'rna_pca',
 'rna_qc_filter',
 'rna_qc_metrics',
 'rna_qc_thresholds',
 'rna_row_names',
 'rna_size_factors',
 'snn_graph',
 'to_singlecellexperiment',
 'tsne',
 'umap']

Let's start with the QC metrics

1results.rna_qc_metrics
ComputeRnaQcMetricsResults(sum=array([ 159702.,   31056., 3325490., ...,    5083.,    8590.,    6460.],
      shape=(85233,)), detected=array([1662,  946,  903, ..., 2188, 3395, 2313],
      shape=(85233,), dtype=uint32), subset_proportion=NamedList(data=[array([0.42016381, 0.04202087, 0.03486043, ..., 0.15542003, 0.02782305,
       0.04628483], shape=(85233,))], names=Names(['mito'])))

I will use plotnine, a ggplot2 implementation for Python to do some exploratory plots. If you want a deeper dive into tidyverse-style analysis in Python, don’t miss my previous post.

First we install the package and dependencies with uv add.

1uv add plotnine
Resolved 41 packages in 683ms
Prepared 8 packages in 529ms
░░░░░░░░░░░░░░░░░░░░ [0/11] Installing wheels...                                              
warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
         If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
Installed 11 packages in 65ms
 + contourpy==1.3.3
 + cycler==0.12.1
 + fonttools==4.60.1
 + kiwisolver==1.4.9
 + matplotlib==3.10.7
 + mizani==0.14.3
 + patsy==1.0.2
 + pillow==12.0.0
 + plotnine==0.15.1
 + pyparsing==3.2.5
 + statsmodels==0.14.5

import the required libraries and prepare the data for the plot. I will start with the number of detected genes per cell.

1import pandas as pd
2df = pd.DataFrame({
3    "sum": results.rna_qc_metrics.sum,
4    "detected": results.rna_qc_metrics.detected
5})

make the plot.

1from plotnine import ggplot, aes, geom_point
2
3p = (
4    ggplot(df, aes(x="sum", y="detected")) +
5    geom_point(alpha=0.6)
6)
7
8p.draw()

png

weird plot, let's try log scale

1from plotnine import scale_x_log10, scale_y_log10
2
3p_log = p + scale_x_log10() + scale_y_log10()
4p_log.draw()
/media/alfonso/data/biocpy_test/.venv/lib/python3.11/site-packages/plotnine/scales/scales.py:48: PlotnineWarning: Scale for 'x' is already present.
Adding another scale for 'x',
which will replace the existing scale.

/media/alfonso/data/biocpy_test/.venv/lib/python3.11/site-packages/plotnine/scales/scales.py:48: PlotnineWarning: Scale for 'y' is already present.
Adding another scale for 'y',
which will replace the existing scale.

png

Now let's print the authomatic QC thresholds

1print( results.rna_qc_thresholds )
SuggestRnaQcThresholdsResults(sum=578.778090441809, detected=267.1819648487058, subset_proportion=NamedList(data=[np.float64(0.09147777129294579)], names=Names(['mito'])), block=None)

and create some violin plots to evaluate them. Let's start with detected genes.

 1from plotnine import ggplot, aes, geom_violin, geom_hline, theme_bw, labs
 2
 3thr = results.rna_qc_thresholds
 4
 5p_sum = (
 6    ggplot(df, aes(y="sum", x="''")) +
 7    geom_violin() + 
 8    scale_y_log10() +
 9    geom_hline(yintercept=thr.sum, linetype="dashed", color="red") +
10    theme_bw() +
11    labs(title="Violin plot: SUM", x="", y="sum")
12)
13
14p_sum.draw()

png

Now number of UMIs per cell.

 1p_sum = (
 2    ggplot(df, aes(y="detected", x="''")) +
 3    geom_violin() + 
 4    scale_y_log10() +
 5    geom_hline(yintercept=thr.detected, linetype="dashed", color="red") +
 6    theme_bw() +
 7    labs(title="Violin plot: detected", x="", y="detected")
 8)
 9
10p_sum.draw()

png

and finally the percentage of mitochondrial reads.

 1df["mito"] = results.rna_qc_metrics.subset_proportion["mito"]
 2
 3p_sum = (
 4    ggplot(df, aes(y="mito", x="''")) +
 5    geom_violin() + 
 6    geom_hline(yintercept=thr.subset_proportion["mito"], linetype="dashed", color="red") +
 7    theme_bw() +
 8    labs(title="Violin plot: Mito proportion", x="", y="mito proportion")
 9)
10
11p_sum.draw()

png

Now lets take a look to the cells passing the filters

1filtered = results.to_singlecellexperiment()
2print(filtered)
class: SingleCellExperiment
dimensions: (61759, 80268)
assays(2): ['filtered', 'normalized']
row_data columns(5): ['mean', 'variance', 'fitted', 'residual', 'is_highly_variable']
row_names(61759): ['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', ..., 'ENSG00000290164', 'ENSG00000290165', 'ENSG00000290166']
column_data columns(5): ['sum', 'detected', 'subset_proportion_mito', 'size_factors', 'clusters']
column_names(80268): ['TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_L11', 'TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_J16', 'TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_F5', ..., 'TSP10_Blood_NA_10X_1_1_Enriched_TTTGGTTAGTACTGGG', 'TSP10_Blood_NA_10X_1_1_Enriched_TTTGTTGTCCAAGCCG', 'TSP10_Blood_NA_10X_1_1_Enriched_TTTGTTGTCCGCGATG']
main_experiment_name:  
reduced_dims(3): ['pca', 'tsne', 'umap']
alternative_experiments(0): []
row_pairs(0): []
column_pairs(0): []
metadata(0): 

we can see that around 5000 cells were filtered out.

Let's plot the UMAP to take a look. We start exploring the names of the reduced dims.

1print(filtered.get_reduced_dim_names())
['pca', 'tsne', 'umap']

Now check the UMAP coordinate values.

1filtered.reduced_dim(2)
array([[ 13.37984943,  -9.00475979],
       [ 11.55536652,  -2.08098984],
       [ 11.56075191,  -2.07389855],
       ...,
       [  0.04678072, -15.85924625],
       [  4.65971947,  -5.07757521],
       [  1.22210228, -15.72425938]], shape=(80268, 2))

and prepare the data for the plot.

1plot_df = pd.DataFrame(filtered.reduced_dim(2), columns=[ 'UMAP ' + str(i) for i in range(1,3) ])
2
3plot_df.head()

UMAP 1 UMAP 2
0 13.379849 -9.004760
1 11.555367 -2.080990
2 11.560752 -2.073899
3 11.531227 -1.666889
4 12.428975 3.263102

Now let's plot the UMAP.

1p = (
2    ggplot(plot_df, aes(x='UMAP 1', y='UMAP 2')) +
3    geom_point(alpha=0.6)
4)
1p.draw()

png

Now add the clustering information, but keeping the ordering of the clusters. Let's prep the data

1import numpy as np
2ordered_labels = [str(i) for i in np.unique(filtered.column_data['clusters'])]
3plot_df['clusters'] = pd.Categorical(
4    filtered.column_data['clusters'].astype(str),
5    categories=ordered_labels,
6    ordered=True
7)

and check the plot

1p = (
2    ggplot(plot_df, aes(x='UMAP 1', y='UMAP 2', color = 'clusters')) +
3    geom_point(alpha=0.6)
4)
5
6p.draw()

png

most likely the object had multiple donors and we are seeing batch effect. Let's check the original adata object

1adata.obs['donor_id']
TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_G15         TSP2
TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_L11         TSP2
TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_J16         TSP2
TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_F5          TSP2
TSP2_Blood_NA_SS2_B113459_B133094_LinNeg_N22         TSP2
                                                    ...  
TSP10_Blood_NA_10X_1_1_Enriched_TTTGGAGGTCGGTGTC    TSP10
TSP10_Blood_NA_10X_1_1_Enriched_TTTGGTTAGTACTGGG    TSP10
TSP10_Blood_NA_10X_1_1_Enriched_TTTGTTGAGGGAGGTG    TSP10
TSP10_Blood_NA_10X_1_1_Enriched_TTTGTTGTCCAAGCCG    TSP10
TSP10_Blood_NA_10X_1_1_Enriched_TTTGTTGTCCGCGATG    TSP10
Name: donor_id, Length: 85233, dtype: category
Categories (9, object): ['TSP1', 'TSP2', 'TSP7', 'TSP8', ..., 'TSP14', 'TSP21', 'TSP25', 'TSP27']

Of course, most parameters can be changed by modifying the relevant arguments in analyze() like below

 1results.custom = scranpy.analyze(
 2    sce,
 3    rna_subsets = {
 4        "mito": is_mito
 5    },
 6    build_snn_graph_options = {
 7        "num_neighbors": 10
 8    },
 9    cluster_graph_options = {
10        "multilevel_resolution": 2
11    },
12    run_pca_options = {
13        "number": 15
14    },
15    run_tsne_options = {
16        "perplexity": 25
17    },
18    run_umap_options = {
19        "min_dist": 0.05
20    }
21)

And for more advanced pipelines, each step can be run independently, just as in the R scran workflow. The scranpy documentation is a good starting point if you need more control.

Take home messages

  • biocpy and related packagesbring Bioconductor-style data structures and workflows into Python.
  • The analyses run extremely fast, noticeably faster than the equivalent R workflows.
  • The ecosystem is still developing but looks promising.

Further reading

comments powered by Disqus