Pairwise Controlled Manifold Approximation (PaCMAP) for Single Cell

While preparing my posts on t-SNE vs UMAP and Phate I found this paper, which introduced a new dimensional reduction method called PaCMAP (Pairwise Controlled Manifold Approximation) and compared it with tSNE, UMAP, and other methods I did not know.

PaCMAP paper

According to the paper, PaCMAP is a new method preserves both local and global structure in high-dimensional data, making it particularly suitable for single-cell RNA-seq data visualization. It is designed to overcome some of the limitations of existing methods like t-SNE and UMAP, such as sensitivity to parameter settings and computational efficiency. After reading this, I decided to give it a try.

It is a python package so I installed in a dedicated virtual environment to avoid issues with other packages.

1mkvirtualenv pacmap
2pip install pacmap

I use reticulate R pacakge to manage the communication of R and the Python virtual environment. Run this chunk at the begining to avoid issues.

1reticulate::use_virtualenv("~/.virtualenvs/pacmap", required = TRUE)
2reticulate::py_config()
1## python:         /home/alfonso/.virtualenvs/pacmap/bin/python
2## libpython:      /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
3## pythonhome:     /home/alfonso/.virtualenvs/pacmap:/home/alfonso/.virtualenvs/pacmap
4## virtualenv:     /home/alfonso/.virtualenvs/pacmap/bin/activate_this.py
5## version:        3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0]
6## numpy:          /home/alfonso/.virtualenvs/pacmap/lib/python3.10/site-packages/numpy
7## numpy_version:  2.2.6
8## 
9## NOTE: Python version was forced by use_python() function

I will use our good old friend, the dataset of blood cells from the cellxgene database that I used in many of my previous posts such as Scanpy, Seurat or SCE.

The following code reads and filters the dataset. See my previous posts for more details.

1library(Seurat)
2rds.file <- file.path( '582149d8-2a8f-44cf-9605-337b8ca8d518.rds' )
3seurat <- readRDS(rds.file)
 1library(dplyr)
 2library(tidyr)
 3library(tidylog)
 4keep.cells <- seurat[[]] %>%
 5  tibble::rownames_to_column( 'Cell' ) %>%
 6  filter( assay == "10x 3' v3", donor_id == 'TSP1' ) %>%
 7  group_by( cell_type ) %>%
 8  mutate( tot = n() ) %>%
 9  filter( tot >= 20 )
10
11pbmc <- subset(seurat, cells = keep.cells$Cell)
12dim(pbmc)
1## [1] 61759  4215

Since the PaCMAP algorithm is implemented in Python, I will use an R wrapper, part of the SeuratWrappers package so I will preprocess the object using the Seurat workflow. If you are not familiar with Seurat, I recommend you to read my previous post on Seurat first.

Before starting, we must convert the Seurat object to the Seurat v5.2.1 structure.

1pbmc <- UpdateSeuratObject(pbmc)
1## Validating object structure
1## Updating object slots
1## Ensuring keys are in the proper structure
1## Updating matrix keys for DimReduc 'pca'
1## Updating matrix keys for DimReduc 'scvi'
1## Updating matrix keys for DimReduc 'tissue_uncorrected_umap'
1## Updating matrix keys for DimReduc 'umap'
1## Updating matrix keys for DimReduc 'umap_scvi_full_donorassay'
1## Updating matrix keys for DimReduc 'umap_tissue_scvi_donorassay'
1## Updating matrix keys for DimReduc 'uncorrected_umap'
1## Ensuring keys are in the proper structure
1## Ensuring feature names don't have underscores or pipes
1## Updating slots in RNA
1## Updating slots in pca
1## Updating slots in scvi
1## Updating slots in tissue_uncorrected_umap
1## Setting tissue_uncorrected_umap DimReduc to global
1## Updating slots in umap
1## Setting umap DimReduc to global
1## Updating slots in umap_scvi_full_donorassay
1## Setting umap_scvi_full_donorassay DimReduc to global
1## Updating slots in umap_tissue_scvi_donorassay
1## Setting umap_tissue_scvi_donorassay DimReduc to global
1## Updating slots in uncorrected_umap
1## Setting uncorrected_umap DimReduc to global
1## Validating object structure for Assay 'RNA'
1## Validating object structure for DimReduc 'pca'
1## Validating object structure for DimReduc 'scvi'
1## Validating object structure for DimReduc 'tissue_uncorrected_umap'
1## Validating object structure for DimReduc 'umap'
1## Validating object structure for DimReduc 'umap_scvi_full_donorassay'
1## Validating object structure for DimReduc 'umap_tissue_scvi_donorassay'
1## Validating object structure for DimReduc 'uncorrected_umap'
1## Object representation is consistent with the most current Seurat version
1pbmc
1## An object of class Seurat 
2## 61759 features across 4215 samples within 1 assay 
3## Active assay: RNA (61759 features, 0 variable features)
4##  2 layers present: counts, data
5##  7 dimensional reductions calculated: pca, scvi, tissue_uncorrected_umap, umap, umap_scvi_full_donorassay, umap_tissue_scvi_donorassay, uncorrected_umap

QC evaluation

1mito.genes <- pbmc[['RNA']][[]] %>%
2  as.data.frame() %>%
3  filter( mt == T ) %>%
4  tibble::rownames_to_column( 'gene' ) %>%
5  pull('gene')
1## filter: removed 61,722 rows (>99%), 37 rows remaining
1pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, features = mito.genes)
2# Visualize QC metrics as a violin plot
3VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
1## Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
2## ℹ Please use the `layer` argument instead.
3## ℹ The deprecated feature was likely used in the Seurat package.
4##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
5## This warning is displayed once every 8 hours.
6## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
7## generated.
1## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
2## ℹ Please use `rlang::check_installed()` instead.
3## ℹ The deprecated feature was likely used in the Seurat package.
4##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
5## This warning is displayed once every 8 hours.
6## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
7## generated.
1FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by = 'cell_type' )

As expected, red blood cells have low number of features and counts.

I would like to evaluate the percent mito in the previous plot. By default, FeatureScatter() does not support a color.by argument for numerial values, we can work around this easily by extracting the data and using ggplot2.

1library(ggplot2)
2plot_df <- FetchData(pbmc, vars = c("nCount_RNA", "nFeature_RNA", "percent.mt"))
3ggplot(plot_df, aes(x = nCount_RNA, y = nFeature_RNA, color = percent.mt)) +
4  geom_point(size = 1) +
5  scale_color_viridis_c() +  
6  theme_bw() +
7  labs(color = "% mito", x = "nCount_RNA", y = "nFeature_RNA")
1ggplot(plot_df, aes(x = nCount_RNA)) +
2  geom_histogram(fill = "#2c7fb8", color = "white") +
3  theme_bw() +
4  labs(title = "Histogram of nCount_RNA", x = "nCount_RNA", y = "Cell count")
1## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
1ggplot(plot_df, aes(x = nFeature_RNA)) +
2  geom_histogram(fill = "#2c7fb8", color = "white") +
3  theme_bw() +
4  labs(title = "Histogram of nFeature_RNA", x = "nFeature_RNA", y = "Cell count")
1## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
1ggplot(plot_df, aes(x = percent.mt)) +
2  geom_histogram(fill = "#2c7fb8", color = "white") +
3  theme_bw() +
4  labs(title = "Histogram of percent.mt", x = "percent.mt", y = "Cell count")
1## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Usually we would filter based on genes, umis and mt percentage, depending on the experimental settings. Since we are using cellxgene data, and to keep the red blood cells, I will use only a soft mitochondrial threshold.

1pbmc <- subset(pbmc, subset = percent.mt < 15)
2dim(pbmc)
1## [1] 61759  4207

Normalize data

1pbmc <- NormalizeData(pbmc)
1## Warning: The `slot` argument of `SetAssayData()` is deprecated as of SeuratObject 5.0.0.
2## ℹ Please use the `layer` argument instead.
3## ℹ The deprecated feature was likely used in the Seurat package.
4##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
5## This warning is displayed once every 8 hours.
6## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
7## generated.
1## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
2## ℹ Please use the `layer` argument instead.
3## ℹ The deprecated feature was likely used in the Seurat package.
4##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
5## This warning is displayed once every 8 hours.
6## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
7## generated.

Find high variable genes for PCA

1pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

Scale data

1pbmc <- ScaleData(pbmc, features = rownames(pbmc) )
1## Centering and scaling data matrix

Run PCA

1pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
 1## PC_ 1 
 2## Positive:  ENSG00000085265, ENSG00000163131, ENSG00000120594, ENSG00000197249, ENSG00000087086, ENSG00000011600, ENSG00000090382, ENSG00000148655, ENSG00000101439, ENSG00000172322 
 3## 	   ENSG00000030582, ENSG00000204472, ENSG00000158869, ENSG00000025708, ENSG00000170458, ENSG00000197746, ENSG00000163220, ENSG00000038427, ENSG00000172243, ENSG00000121552 
 4## 	   ENSG00000170345, ENSG00000123384, ENSG00000127951, ENSG00000257764, ENSG00000165168, ENSG00000090376, ENSG00000163563, ENSG00000104093, ENSG00000197629, ENSG00000113269 
 5## Negative:  ENSG00000127152, ENSG00000198821, ENSG00000008517, ENSG00000165929, ENSG00000113263, ENSG00000211772, ENSG00000277734, ENSG00000227507, ENSG00000171791, ENSG00000153283 
 6## 	   ENSG00000069667, ENSG00000168685, ENSG00000161405, ENSG00000172673, ENSG00000152495, ENSG00000151150, ENSG00000109452, ENSG00000205045, ENSG00000081059, ENSG00000237943 
 7## 	   ENSG00000289740, ENSG00000138378, ENSG00000138795, ENSG00000112182, ENSG00000151623, ENSG00000126353, ENSG00000173762, ENSG00000135426, ENSG00000163519, ENSG00000099204 
 8## PC_ 2 
 9## Positive:  ENSG00000153064, ENSG00000105369, ENSG00000156738, ENSG00000247982, ENSG00000144218, ENSG00000196092, ENSG00000167483, ENSG00000144645, ENSG00000211899, ENSG00000012124 
10## 	   ENSG00000007312, ENSG00000164330, ENSG00000095585, ENSG00000136573, ENSG00000204287, ENSG00000163534, ENSG00000019582, ENSG00000196735, ENSG00000211898, ENSG00000119866 
11## 	   ENSG00000179344, ENSG00000082293, ENSG00000112232, ENSG00000042980, ENSG00000176533, ENSG00000116191, ENSG00000231389, ENSG00000289963, ENSG00000081189, ENSG00000116194 
12## Negative:  ENSG00000008517, ENSG00000198821, ENSG00000077984, ENSG00000127152, ENSG00000113263, ENSG00000165929, ENSG00000205045, ENSG00000277734, ENSG00000153283, ENSG00000172673 
13## 	   ENSG00000138378, ENSG00000122862, ENSG00000168685, ENSG00000237943, ENSG00000109452, ENSG00000152495, ENSG00000069667, ENSG00000152270, ENSG00000173762, ENSG00000059804 
14## 	   ENSG00000164483, ENSG00000138795, ENSG00000211751, ENSG00000163421, ENSG00000145819, ENSG00000074966, ENSG00000172543, ENSG00000163519, ENSG00000135636, ENSG00000123146 
15## PC_ 3 
16## Positive:  ENSG00000197747, ENSG00000084207, ENSG00000124942, ENSG00000182718, ENSG00000213145, ENSG00000197471, ENSG00000108798, ENSG00000168329, ENSG00000125148, ENSG00000156299 
17## 	   ENSG00000182578, ENSG00000137486, ENSG00000175198, ENSG00000281103, ENSG00000198821, ENSG00000100097, ENSG00000119655, ENSG00000106066, ENSG00000008517, ENSG00000196230 
18## 	   ENSG00000115738, ENSG00000166927, ENSG00000182621, ENSG00000104972, ENSG00000198879, ENSG00000163590, ENSG00000105374, ENSG00000114013, ENSG00000155366, ENSG00000170315 
19## Negative:  ENSG00000162747, ENSG00000162551, ENSG00000163421, ENSG00000257335, ENSG00000140932, ENSG00000157551, ENSG00000196549, ENSG00000115590, ENSG00000151726, ENSG00000163993 
20## 	   ENSG00000135636, ENSG00000100985, ENSG00000248323, ENSG00000286076, ENSG00000203710, ENSG00000204936, ENSG00000059804, ENSG00000115594, ENSG00000143797, ENSG00000183762 
21## 	   ENSG00000105835, ENSG00000234426, ENSG00000127954, ENSG00000291237, ENSG00000183019, ENSG00000163221, ENSG00000146592, ENSG00000009694, ENSG00000111052, ENSG00000159339 
22## PC_ 4 
23## Positive:  ENSG00000172005, ENSG00000138795, ENSG00000186854, ENSG00000126353, ENSG00000152495, ENSG00000109452, ENSG00000163519, ENSG00000081059, ENSG00000249806, ENSG00000182463 
24## 	   ENSG00000141576, ENSG00000290067, ENSG00000227507, ENSG00000168685, ENSG00000249859, ENSG00000091409, ENSG00000165591, ENSG00000107104, ENSG00000189283, ENSG00000165272 
25## 	   ENSG00000246223, ENSG00000184613, ENSG00000111863, ENSG00000197635, ENSG00000151150, ENSG00000170074, ENSG00000135426, ENSG00000271774, ENSG00000232021, ENSG00000237943 
26## Negative:  ENSG00000105374, ENSG00000115523, ENSG00000134539, ENSG00000100453, ENSG00000180644, ENSG00000145649, ENSG00000137441, ENSG00000100450, ENSG00000275302, ENSG00000116667 
27## 	   ENSG00000271503, ENSG00000069702, ENSG00000107719, ENSG00000150045, ENSG00000171476, ENSG00000181036, ENSG00000159618, ENSG00000205336, ENSG00000170962, ENSG00000156475 
28## 	   ENSG00000077984, ENSG00000159674, ENSG00000168229, ENSG00000251301, ENSG00000185697, ENSG00000150687, ENSG00000104490, ENSG00000235162, ENSG00000169583, ENSG00000172543 
29## PC_ 5 
30## Positive:  ENSG00000129757, ENSG00000287682, ENSG00000188290, ENSG00000107954, ENSG00000157827, ENSG00000203747, ENSG00000180229, ENSG00000142089, ENSG00000233392, ENSG00000148737 
31## 	   ENSG00000214407, ENSG00000183779, ENSG00000166165, ENSG00000166068, ENSG00000135047, ENSG00000259225, ENSG00000224397, ENSG00000258732, ENSG00000155366, ENSG00000123358 
32## 	   ENSG00000229140, ENSG00000166927, ENSG00000182578, ENSG00000170873, ENSG00000142512, ENSG00000120949, ENSG00000074416, ENSG00000106341, ENSG00000107736, ENSG00000226321 
33## Negative:  ENSG00000120885, ENSG00000008394, ENSG00000138061, ENSG00000135218, ENSG00000038427, ENSG00000121807, ENSG00000169385, ENSG00000257764, ENSG00000164125, ENSG00000182621 
34## 	   ENSG00000136830, ENSG00000135046, ENSG00000161944, ENSG00000164124, ENSG00000090382, ENSG00000120738, ENSG00000121316, ENSG00000105967, ENSG00000010327, ENSG00000226822 
35## 	   ENSG00000186815, ENSG00000157168, ENSG00000110077, ENSG00000104918, ENSG00000257017, ENSG00000125740, ENSG00000177575, ENSG00000128487, ENSG00000170458, ENSG00000178695
1DimPlot(pbmc, reduction = "pca", group.by = 'cell_type')

Run UMAP, for this quick test I will use all PCs

1pbmc <- RunUMAP(pbmc, dims = 1:50)
1## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
2## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
3## This message will be shown once per session
1## 17:13:37 UMAP embedding parameters a = 0.9922 b = 1.112
1## 17:13:37 Read 4207 rows and found 50 numeric columns
1## 17:13:37 Using Annoy for neighbor search, n_neighbors = 30
1## 17:13:37 Building Annoy index with metric = cosine, n_trees = 50
1## 0%   10   20   30   40   50   60   70   80   90   100%
1## [----|----|----|----|----|----|----|----|----|----|
1## **************************************************|
2## 17:13:37 Writing NN index file to temp file /tmp/RtmpRhKGc1/file14a1d62c832e8c
3## 17:13:37 Searching Annoy index using 1 thread, search_k = 3000
4## 17:13:38 Annoy recall = 100%
5## 17:13:38 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
6## 17:13:38 Initializing from normalized Laplacian + noise (using RSpectra)
7## 17:13:39 Commencing optimization for 500 epochs, with 175432 positive edges
8## 17:13:39 Using rng type: pcg
9## 17:13:43 Optimization finished

Plot UMAP

1DimPlot(pbmc, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'cell_type' ) + NoLegend()

Run PaCMAP using the SeuratWrappers package.

1# remotes::install_github('satijalab/seurat-wrappers')
2library(SeuratWrappers)
3pbmc <- RunPaCMAP(object = pbmc, features = Seurat::VariableFeatures(pbmc))
 1## Applied PCA, the dimensionality becomes 100
 2## PaCMAP(n_neighbors=10, n_MN=5, n_FP=20, distance=euclidean, lr=1.0, n_iters=(100, 100, 250), apply_pca=True, opt_method='adam', verbose=True, intermediate=False, seed=11)
 3## Finding pairs
 4## Found nearest neighbor
 5## Calculated sigma
 6## Found scaled dist
 7## Pairs sampled successfully.
 8## ((42070, 2), (21035, 2), (84140, 2))
 9## Initial Loss: 51819.17578125
10## Iteration:   10, Loss: 39982.265625
11## Iteration:   20, Loss: 31825.072266
12## Iteration:   30, Loss: 29064.273438
13## Iteration:   40, Loss: 27609.316406
14## Iteration:   50, Loss: 26209.269531
15## Iteration:   60, Loss: 24828.828125
16## Iteration:   70, Loss: 23298.683594
17## Iteration:   80, Loss: 21438.656250
18## Iteration:   90, Loss: 19014.822266
19## Iteration:  100, Loss: 15215.458984
20## Iteration:  110, Loss: 18434.947266
21## Iteration:  120, Loss: 18176.066406
22## Iteration:  130, Loss: 18079.007812
23## Iteration:  140, Loss: 18039.667969
24## Iteration:  150, Loss: 18020.625000
25## Iteration:  160, Loss: 18009.695312
26## Iteration:  170, Loss: 18001.855469
27## Iteration:  180, Loss: 17995.763672
28## Iteration:  190, Loss: 17990.417969
29## Iteration:  200, Loss: 17984.171875
30## Iteration:  210, Loss: 8186.588379
31## Iteration:  220, Loss: 8046.067871
32## Iteration:  230, Loss: 7987.240234
33## Iteration:  240, Loss: 7946.288086
34## Iteration:  250, Loss: 7918.241211
35## Iteration:  260, Loss: 7897.881348
36## Iteration:  270, Loss: 7881.938477
37## Iteration:  280, Loss: 7868.776367
38## Iteration:  290, Loss: 7857.554199
39## Iteration:  300, Loss: 7847.858398
40## Iteration:  310, Loss: 7839.208008
41## Iteration:  320, Loss: 7831.520020
42## Iteration:  330, Loss: 7824.563477
43## Iteration:  340, Loss: 7818.220215
44## Iteration:  350, Loss: 7812.399414
45## Iteration:  360, Loss: 7806.992188
46## Iteration:  370, Loss: 7802.010742
47## Iteration:  380, Loss: 7797.365723
48## Iteration:  390, Loss: 7793.033691
49## Iteration:  400, Loss: 7788.949219
50## Iteration:  410, Loss: 7785.119141
51## Iteration:  420, Loss: 7781.532227
52## Iteration:  430, Loss: 7778.101562
53## Iteration:  440, Loss: 7774.856445
54## Iteration:  450, Loss: 7771.775879
55## Elapsed time: 0.86s

Plot PaCMAP

1DimPlot(pbmc, reduction="pacmap", label = TRUE, repel = TRUE, group.by = 'cell_type' ) + NoLegend()

Other datasets

I will use 2 datasets from the Human Cell Atlas, retrieved using the scRNAseq package.

1library(scRNAseq)
2library(SingleCellExperiment)

Liver

Download and normalize SingleCellExperiment object usign a custom function. I challenge you to create such a function (tip: you can find the solution at the end of the post)

1sce.liver <- fetchDataset("he-organs-2020","2023-12-21","liver")
2sce.liver <- normalizeSCE(sce.liver)

Prepare the object defining the format of the data to avoid issues converting the object.

1library(Matrix)
2counts(sce.liver) <- as(counts(sce.liver), "dgCMatrix")
3logcounts(sce.liver) <- as(logcounts(sce.liver), "dgCMatrix")
1liver <- as.Seurat(sce.liver, counts = "counts", data = "logcounts")
2liver <- RenameAssays(liver, originalexp = "RNA")
1## Renaming default assay from originalexp to RNA
1## Warning: Key 'originalexp_' taken, using 'rna_' instead
1DefaultAssay(liver) <- "RNA"

As in previous posts, as with normalization I use a custom function for the Seurat processing. I challenge you again to create a similar function.

1liver <- seuratProcessing(liver)
1## Centering and scaling data matrix
 1## PC_ 1 
 2## Positive:  SRGN, ACTB, TYROBP, CD52, MT-CO1, FCER1G, CTSS, CCL4, AIF1, BCL2A1 
 3## 	   CYBB, LST1, COTL1, LYZ, IFI30, CCL3, SPI1, CTSC, CD69, FTH1 
 4## 	   NKG7, FCN1, C5AR1, FCGR3A, CCL5, LGALS1, CD83, FTL, S100A9, HLA-DRA 
 5## Negative:  TM4SF1, TINAGL1, IFI27, PLPP3, HSPG2, APP, IGFBP4, IL33, CD59, PDLIM1 
 6## 	   SLC9A3R2, IGFBP7, RAMP2, CLEC3B, CRIP2, GNG11, A2M, SPARC, HYAL2, LDB2 
 7## 	   CCL14, CALCRL, NPDC1, S100A16, ID1, TFPI, PTPRB, AKAP12, EGFL7, ADGRL4 
 8## PC_ 2 
 9## Positive:  FTL, TIMP1, PSAP, NPC2, CTSB, CTSL, CD14, MS4A6A, GPX1, FCGRT 
10## 	   AIF1, CTSS, CTSD, FKBP1A, HLA-DRA, CD68, HLA-DRB5, CST3, FCER1G, NEAT1 
11## 	   SGK1, ASAH1, IFI30, CYBB, TYROBP, MRC1, MARCKS, HLA-DRB1, SAT1, RNASE1 
12## Negative:  KRT8, DEFB1, FXYD2, KRT7, VTN, KRT18, TACSTD2, CLDN1, TM4SF4, SOX9 
13## 	   GC, AMBP, CD24, KRT19, CLDN10, ELF3, AGT, CLDN7, CLDN3, CLDN4 
14## 	   ALB, TSPAN8, PAH, ANXA4, LGALS4, SPP1, SERPINA5, PRSS22, EPCAM, CFTR 
15## PC_ 3 
16## Positive:  CST3, SERPINA1, PSAP, FTL, CTSS, AIF1, CTSB, FTH1, CD68, IFI30 
17## 	   CYBB, FCER1G, SAT1, NPC2, TYROBP, LYZ, GPX1, LST1, HLA-DRA, MS4A7 
18## 	   ASAH1, SPI1, SOD2, C1QB, MAFB, LIPA, C1QA, C5AR1, GLUL, MS4A6A 
19## Negative:  CCL5, IL32, CD69, CST7, NKG7, GZMA, CD7, KLRB1, KLRD1, CD8A 
20## 	   PRF1, TRBC1, IFNG, SH2D2A, CCL4, XCL2, GZMH, CD247, XCL1, GZMK 
21## 	   IL2RB, LINC00861, SH2D1A, GZMB, AC092580.4, CD8B, HOPX, STMN1, KLRF1, NCR3 
22## PC_ 4 
23## Positive:  FCN2, FCN3, CLEC4G, CRHBP, OIT3, DNASE1L3, CLEC1B, STAB2, FAM167B, CLEC4M 
24## 	   PLTP, ACP5, RELN, KDR, FCGR2B, GPR182, HECW2, NID1, TSPAN7, LYVE1 
25## 	   F8, MRC1, PRCP, CCL14, MAN1C1, ALPL, STAB1, RAMP3, FZD4, ADGRL4 
26## Negative:  MYL9, TPM2, MYH11, TAGLN, ACTA2, MYLK, SOD3, MFGE8, TPM1, SPARCL1 
27## 	   CNN1, HEYL, CCDC3, HSPB6, FRZB, PTP4A3, MT1M, SLIT3, CRISPLD2, CALD1 
28## 	   PRKCDBP, WFDC1, RERGL, ADIRF, RCAN2, PLN, PLAC9, C11orf96, EDNRA, LMOD1 
29## PC_ 5 
30## Positive:  LGMN, C1QC, C1QB, C1QA, MRC1, SEPP1, SLC40A1, MAF, LYVE1, CTSL 
31## 	   VSIG4, FOLR2, FPR3, HSP90B1, SDC3, GPNMB, LIPA, TYMS, CALR, APOC1 
32## 	   SLC7A8, FCGR3A, MSR1, TUBB, KIAA0101, SMPDL3A, CD163, TUBA1B, RGL1, PDK4 
33## Negative:  S100A8, VCAN, S100A12, S100A6, S100A9, LYZ, MNDA, FCN1, S100A10, CXCL8 
34## 	   EREG, VIM, CYP1B1, SELL, AQP9, LGALS2, CLU, EMP1, PLBD1, NRG1 
35## 	   CD300E, CXCL2, CAPG, VNN2, PLAUR, PTGS2, PECAM1, AREG, LILRA5, MGST1
1## 17:14:07 UMAP embedding parameters a = 0.9922 b = 1.112
1## 17:14:07 Read 2839 rows and found 50 numeric columns
1## 17:14:07 Using Annoy for neighbor search, n_neighbors = 30
1## 17:14:07 Building Annoy index with metric = cosine, n_trees = 50
1## 0%   10   20   30   40   50   60   70   80   90   100%
1## [----|----|----|----|----|----|----|----|----|----|
1## **************************************************|
2## 17:14:07 Writing NN index file to temp file /tmp/RtmpRhKGc1/file14a1d64fd33fdf
3## 17:14:07 Searching Annoy index using 1 thread, search_k = 3000
4## 17:14:08 Annoy recall = 100%
5## 17:14:08 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
6## 17:14:10 Initializing from normalized Laplacian + noise (using RSpectra)
7## 17:14:10 Commencing optimization for 500 epochs, with 118312 positive edges
8## 17:14:10 Using rng type: pcg
9## 17:14:12 Optimization finished
 1## Applied PCA, the dimensionality becomes 100
 2## PaCMAP(n_neighbors=10, n_MN=5, n_FP=20, distance=euclidean, lr=1.0, n_iters=(100, 100, 250), apply_pca=True, opt_method='adam', verbose=True, intermediate=False, seed=11)
 3## Finding pairs
 4## Found nearest neighbor
 5## Calculated sigma
 6## Found scaled dist
 7## Pairs sampled successfully.
 8## ((28390, 2), (14195, 2), (56780, 2))
 9## Initial Loss: 34970.43359375
10## Iteration:   10, Loss: 27314.640625
11## Iteration:   20, Loss: 22465.218750
12## Iteration:   30, Loss: 20762.496094
13## Iteration:   40, Loss: 19591.761719
14## Iteration:   50, Loss: 18551.546875
15## Iteration:   60, Loss: 17575.416016
16## Iteration:   70, Loss: 16500.394531
17## Iteration:   80, Loss: 15241.031250
18## Iteration:   90, Loss: 13608.322266
19## Iteration:  100, Loss: 11090.785156
20## Iteration:  110, Loss: 13570.200195
21## Iteration:  120, Loss: 13434.894531
22## Iteration:  130, Loss: 13375.796875
23## Iteration:  140, Loss: 13353.393555
24## Iteration:  150, Loss: 13341.698242
25## Iteration:  160, Loss: 13335.852539
26## Iteration:  170, Loss: 13332.177734
27## Iteration:  180, Loss: 13329.578125
28## Iteration:  190, Loss: 13327.824219
29## Iteration:  200, Loss: 13326.364258
30## Iteration:  210, Loss: 6296.515625
31## Iteration:  220, Loss: 6198.218750
32## Iteration:  230, Loss: 6157.759766
33## Iteration:  240, Loss: 6132.411621
34## Iteration:  250, Loss: 6116.173340
35## Iteration:  260, Loss: 6104.250977
36## Iteration:  270, Loss: 6095.444824
37## Iteration:  280, Loss: 6088.108398
38## Iteration:  290, Loss: 6081.856445
39## Iteration:  300, Loss: 6076.359375
40## Iteration:  310, Loss: 6071.420898
41## Iteration:  320, Loss: 6066.969727
42## Iteration:  330, Loss: 6062.935059
43## Iteration:  340, Loss: 6059.186523
44## Iteration:  350, Loss: 6055.759766
45## Iteration:  360, Loss: 6052.583496
46## Iteration:  370, Loss: 6049.616699
47## Iteration:  380, Loss: 6046.854492
48## Iteration:  390, Loss: 6044.269043
49## Iteration:  400, Loss: 6041.814941
50## Iteration:  410, Loss: 6039.538086
51## Iteration:  420, Loss: 6037.386719
52## Iteration:  430, Loss: 6035.338379
53## Iteration:  440, Loss: 6033.396973
54## Iteration:  450, Loss: 6031.556641
55## Elapsed time: 0.58s

Plot PCA

1DimPlot(liver, reduction = "pca", group.by = 'Cell_type_in_merged_data')

Plot UMAP

1DimPlot(liver, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'Cell_type_in_merged_data' ) + NoLegend()

Plot PaCMAP

1DimPlot(liver, reduction = "pacmap", label = TRUE, repel = TRUE, group.by = 'Cell_type_in_merged_data' ) + NoLegend()

Bone Marrow

Download and normalize

1sce.marrow <- fetchDataset("he-organs-2020","2023-12-21","marrow")
2sce.marrow <- normalizeSCE(sce.marrow)

Prepare the object defining the format of the data to avoid issues converting the object.

1library(Matrix)
2counts(sce.marrow) <- as(counts(sce.marrow), "dgCMatrix")
3logcounts(sce.marrow) <- as(logcounts(sce.marrow), "dgCMatrix")
1marrow <- as.Seurat(sce.marrow, counts = "counts", data = "logcounts")
2marrow <- RenameAssays(marrow, originalexp = "RNA")
1## Renaming default assay from originalexp to RNA
1## Warning: Key 'originalexp_' taken, using 'rna_' instead
1DefaultAssay(marrow) <- "RNA"

Process using custom function

1marrow <- seuratProcessing(marrow)
1## Centering and scaling data matrix
 1## PC_ 1 
 2## Positive:  CCL4, RPS6, RP11-291B21.2, PRDM1, XCL2, CD27, GZMB, LEF1, RGS1, XCL1 
 3## 	   CD40LG, RPLP0, TRGV9, CCL4L2, RPS2, RPS8, FGFBP2, TRGV8, PIM2, FKBP11 
 4## 	   ERN1, RPL7A, MIF, TRGV2, CXCR6, GNLY, ITGA6, CLIC3, CCR7, TRGV10 
 5## Negative:  CST3, FCN1, SERPINA1, CD68, IFI30, AIF1, CD14, TMEM176B, MAFB, LST1 
 6## 	   SPI1, CSF1R, MS4A7, LILRA5, LYZ, CYBB, LILRB2, IGSF6, CD302, CFP 
 7## 	   LRRC25, MS4A6A, LINC01272, PILRA, DMXL2, CPVL, CFD, FTL, SLC7A7, MPEG1 
 8## PC_ 2 
 9## Positive:  CD68, SERPINA1, CTSS, MS4A7, FCN1, IFI30, CYBB, MAFB, AIF1, LILRA5 
10## 	   LST1, LILRB2, CSF1R, TMEM176B, CD14, SPI1, LINC01272, CST3, PILRA, IGSF6 
11## 	   TIMP1, CFP, MPEG1, CD74, LRRC25, SLC7A7, CLEC7A, DMXL2, HMOX1, C1QA 
12## Negative:  CA1, AHSP, HBB, HBA2, HBA1, HBD, HBM, SLC4A1, GYPA, CA2 
13## 	   RHAG, GYPB, ALAS2, SPTA1, HMBS, HEMGN, TMEM56, SLC25A37, EPB42, CR1L 
14## 	   KLF1, MKI67, SPTB, SNCA, TFRC, NUSAP1, ANK1, GYPE, RHD, CENPF 
15## PC_ 3 
16## Positive:  CD79A, MS4A1, HLA-DRA, BANK1, MEF2C, RALGPS2, CD74, LINC00926, FAM129C, CD79B 
17## 	   POU2AF1, IGHM, IGKC, HLA-DQA1, TCF4, VPREB3, ADAM28, HLA-DQB1, MZB1, AFF3 
18## 	   SPIB, RP11-693J15.5, HLA-DRB1, FCRL5, RPS8, CD22, JCHAIN, FCRLA, SYK, IGHA1 
19## Negative:  CCL4, GZMB, GNLY, FCGR3A, TYROBP, FGFBP2, KLRF1, MT2A, CX3CR1, SAMHD1 
20## 	   CLIC3, SPON2, CTSD, FCER1G, XCL2, TRDC, KLRC3, LGALS1, RP11-291B21.2, PRDM1 
21## 	   IFITM3, CCL3, SLC4A1, SH2D1B, HBM, HBA1, GAPDH, CEBPD, HBA2, KLRC2 
22## PC_ 4 
23## Positive:  STMN1, TYMS, GAPDH, CNRIP1, CYTL1, KIAA0101, MYB, CDT1, TK1, TXNDC5 
24## 	   CMBL, EGFL7, PRSS57, ITGA2B, GINS2, MCM10, SMIM10, MAP7, NREP, APOC1 
25## 	   SOX4, SDPR, ALDH1A1, GATA2, PPIB, CDK6, RP11-620J15.3, XBP1, MCM4, CDC25A 
26## Negative:  SLC4A1, MS4A1, HBM, ALAS2, HBA1, GYPA, HBA2, RHAG, SLC25A37, GYPB 
27## 	   CD79A, HLA-DRA, BANK1, EPB42, LINC00926, AHSP, HEMGN, CD74, SPTA1, HBD 
28## 	   CA2, CR1L, FAM129C, GYPE, TSPO2, HLA-DQA1, HLA-DRB1, HBB, RHD, TMEM56 
29## PC_ 5 
30## Positive:  CNRIP1, CYTL1, MYB, EGFL7, CMBL, ITGA2B, SMIM10, APOC1, SDPR, ALDH1A1 
31## 	   SOX4, MAP7, RPS8, GATA2, RP11-354E11.2, NREP, RPS6, PRKAR2B, MYC, RP11-620J15.3 
32## 	   CDC42BPA, ZNF521, IL1B, SLC40A1, FAM178B, PRSS57, RPS2, DPPA4, CD34, HACD1 
33## Negative:  TXNDC5, JCHAIN, MZB1, TNFRSF17, DERL3, GAS6, ITM2C, IGLL5, TRIB1, HSP90B1 
34## 	   IGKC, XBP1, SEC11C, RRM2, ELL2, SDF2L1, MYDGF, FKBP11, PPIB, RRBP1 
35## 	   IGHG1, BMP8B, PRDX4, POU2AF1, PDIA4, FNDC3B, SPCS3, IGHA1, IRF4, SSR3
1## 17:14:29 UMAP embedding parameters a = 0.9922 b = 1.112
1## 17:14:29 Read 3230 rows and found 50 numeric columns
1## 17:14:29 Using Annoy for neighbor search, n_neighbors = 30
1## 17:14:29 Building Annoy index with metric = cosine, n_trees = 50
1## 0%   10   20   30   40   50   60   70   80   90   100%
1## [----|----|----|----|----|----|----|----|----|----|
1## **************************************************|
2## 17:14:29 Writing NN index file to temp file /tmp/RtmpRhKGc1/file14a1d668374e00
3## 17:14:29 Searching Annoy index using 1 thread, search_k = 3000
4## 17:14:30 Annoy recall = 100%
5## 17:14:30 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
6## 17:14:31 Initializing from normalized Laplacian + noise (using RSpectra)
7## 17:14:31 Commencing optimization for 500 epochs, with 141670 positive edges
8## 17:14:31 Using rng type: pcg
9## 17:14:34 Optimization finished
 1## Applied PCA, the dimensionality becomes 100
 2## PaCMAP(n_neighbors=10, n_MN=5, n_FP=20, distance=euclidean, lr=1.0, n_iters=(100, 100, 250), apply_pca=True, opt_method='adam', verbose=True, intermediate=False, seed=11)
 3## Finding pairs
 4## Found nearest neighbor
 5## Calculated sigma
 6## Found scaled dist
 7## Pairs sampled successfully.
 8## ((32300, 2), (16150, 2), (64600, 2))
 9## Initial Loss: 39786.1015625
10## Iteration:   10, Loss: 31564.384766
11## Iteration:   20, Loss: 26498.765625
12## Iteration:   30, Loss: 24599.064453
13## Iteration:   40, Loss: 23357.765625
14## Iteration:   50, Loss: 22296.308594
15## Iteration:   60, Loss: 21254.789062
16## Iteration:   70, Loss: 20105.675781
17## Iteration:   80, Loss: 18757.550781
18## Iteration:   90, Loss: 17056.492188
19## Iteration:  100, Loss: 14561.528320
20## Iteration:  110, Loss: 18082.763672
21## Iteration:  120, Loss: 17965.183594
22## Iteration:  130, Loss: 17916.142578
23## Iteration:  140, Loss: 17902.347656
24## Iteration:  150, Loss: 17892.806641
25## Iteration:  160, Loss: 17889.906250
26## Iteration:  170, Loss: 17887.958984
27## Iteration:  180, Loss: 17886.531250
28## Iteration:  190, Loss: 17885.644531
29## Iteration:  200, Loss: 17885.613281
30## Iteration:  210, Loss: 8883.780273
31## Iteration:  220, Loss: 8748.736328
32## Iteration:  230, Loss: 8681.777344
33## Iteration:  240, Loss: 8642.035156
34## Iteration:  250, Loss: 8625.769531
35## Iteration:  260, Loss: 8614.654297
36## Iteration:  270, Loss: 8606.169922
37## Iteration:  280, Loss: 8599.128906
38## Iteration:  290, Loss: 8592.977539
39## Iteration:  300, Loss: 8587.457031
40## Iteration:  310, Loss: 8582.447266
41## Iteration:  320, Loss: 8577.856445
42## Iteration:  330, Loss: 8573.578125
43## Iteration:  340, Loss: 8569.664062
44## Iteration:  350, Loss: 8565.967773
45## Iteration:  360, Loss: 8562.587891
46## Iteration:  370, Loss: 8559.376953
47## Iteration:  380, Loss: 8556.396484
48## Iteration:  390, Loss: 8553.584961
49## Iteration:  400, Loss: 8550.892578
50## Iteration:  410, Loss: 8548.406250
51## Iteration:  420, Loss: 8546.052734
52## Iteration:  430, Loss: 8543.774414
53## Iteration:  440, Loss: 8541.660156
54## Iteration:  450, Loss: 8539.642578
55## Elapsed time: 0.68s

plot PCA

1DimPlot(marrow, reduction = "pca", group.by = 'Cell_type_in_merged_data')

UMAP

1DimPlot(marrow, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'Cell_type_in_merged_data' ) + NoLegend()

PaCMAP

1DimPlot(marrow, reduction = "pacmap", label = TRUE, repel = TRUE, group.by = 'Cell_type_in_merged_data' ) + NoLegend()

Conclusions

  • Main cell type relationship does not change between PaCMAP and UMAP
  • The only difference between PaCMAP and UMAP is the location of cell type clusters with very small number of cells

Take away

  • I think I will include PaCMAP in my future single cell analysis, it is fast and easy to use.
  • It would be interesting to compare with UMAP in a real life test

Further reading

Bonus I: Animated GIF

Code to generate the GIF

 1library(magick)
 2
 3umap_plot <- DimPlot(marrow, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'Cell_type_in_merged_data') + 
 4  ggtitle("UMAP") + NoLegend()
 5
 6pacmap_plot <- DimPlot(marrow, reduction = "pacmap", label = TRUE, repel = TRUE, group.by = 'Cell_type_in_merged_data') + 
 7  ggtitle("PaCMAP") + NoLegend()
 8
 9umap_plot_file <- tempfile(fileext = ".png")
10pacmap_plot_file <- tempfile(fileext = ".png")
11ggsave(umap_plot_file, plot = umap_plot, width = 6, height = 6)
12ggsave(pacmap_plot_file, plot = pacmap_plot, width = 6, height = 6)
13
14umap_img <- image_read(umap_plot_file)
15pacmap_img <- image_read(pacmap_plot_file)
16
17frames <- c(umap_img, pacmap_img)
18gif <- image_animate(image_join(frames), fps = 1)
19
20image_write(gif, "UMAP_PaCMAP.gif")

Bonus II: Code solution

Here the code I used for the two extra datasets.

 1normalizeSCE <- function(sce) {
 2  require(scran)
 3  sce.clusters <- quickCluster(sce, BPPARAM=BiocParallel::MulticoreParam(8))
 4  sce <- computeSumFactors( sce, 
 5                            cluster=sce.clusters, 
 6                            min.mean=0.1, 
 7                            BPPARAM=BiocParallel::MulticoreParam(8)
 8                          )
 9  sce <- logNormCounts( sce, BPPARAM=BiocParallel::MulticoreParam(8) )
10  return(sce)
11}
12
13seuratProcessing <- function(seurat){
14  seurat <- FindVariableFeatures( seurat, selection.method = "vst", nfeatures = 2000) %>%
15    ScaleData( features = rownames(.) ) %>%
16    RunPCA( features = VariableFeatures(object = .)) %>%
17    RunUMAP( dims = 1:50) %>%
18    RunPaCMAP(features = Seurat::VariableFeatures(.))
19 return(seurat)
20}
comments powered by Disqus