Pseudotime Trajectories with Slingshot

Single cell trajectory analysis can be defined as a technique to determine the pattern of a dynamic process, such as differentiation, experienced by cells and then arrange them based on their progression through that process.

Dynamic biological process (Nguyen et al., Front. Cell Dev. Biol. 2019, 7: 254)

Over 50 methods developed since 2014. However, the core workflow is similar for all of them:

  1. Represent Cells by Gene Expression. This generates a high-dimensional space where each dimension is the expression of a specific gene.
  2. Dimensional Reduction
  3. Calculate a Minimum Spanning Tree (MST)
  4. Compute Principal Curves
  5. Project Cells onto Principal Curves
  6. Calculate Pseudotime

Although there is no best method, according to Saelens et al. (Nat. Biotechnol. 2019), Slingshot is one of the top ranked methods and that is why I selected it for this post.

Trajectory analysis benchmark

Slingshot (Saelens et al., Nat Biotechnol 2019) can work with multiple branches/lineages and has tow main stages:

  1. Global lineage structure using MST on clustered data points
  2. Inference of pseudotime variables for cells along each lineage by fitting simultaneous ‘principal curves’ across multiple lineages
Slingshot

We will use two different datasets to reconstruct trajectories using slingshot, make some plots and find genes changing along the trajectories.

Pancreas dataset

Dataset from Bastidas-Ponce et al. (2018) downloaded from scVelo github (https://github.com/theislab/scvelo_notebooks/tree/master/data/Pancreas/endocrinogenesis_day15.h5ad).

Convert the h5ad file to an RDS file using the zellkonverter package. This is a Bioconductor package that allows you to read and write AnnData files in R.

1library(zellkonverter)
2sce <- readH5AD('endocrinogenesis_day15.h5ad')

Inspect the object

1library(SingleCellExperiment)
2library(scater)
3sce
## class: SingleCellExperiment 
## dim: 27998 3696 
## metadata(5): clusters_coarse_colors clusters_colors day_colors
##   neighbors pca
## assays(3): X spliced unspliced
## rownames(27998): Xkr4 Gm37381 ... Gm20837 Erdr1
## rowData names(1): highly_variable_genes
## colnames(3696): AAACCTGAGAGGGATA AAACCTGAGCCTTGAT ... TTTGTCATCGAATGCT
##   TTTGTCATCTGTTTGT
## colData names(4): clusters_coarse clusters S_score G2M_score
## reducedDimNames(2): X_pca X_umap
## mainExpName: NULL
## altExpNames(0):

Filter non expressed genes

1sce <- sce[rowSums(assay(sce, "X"))>0,]
2dim(sce)
## [1] 17750  3696

Visualize the cell types on a UMAP plot

1plotReducedDim(sce, 'X_umap', colour_by = 'clusters', text_by = 'clusters') +
2  theme_bw() 

Calculate trajectories using slingshot from the UMAP coordinates.

1library(slingshot)
2pto <- slingshot(reducedDim(sce,'X_umap'), clusterLabels = sce$clusters,
3                 start.clus = 'Ductal', end.clus = c('Alpha', 'Beta', 'Delta', 'Epsilon'), spar = 1.1)

We start checking the Minimum spanning tree (MST) that connects all individual cells or cell states with the minimum total distance, reflecting the shortest path through cellular transitions.

1slsMST <- slingMST(pto, as.df = TRUE)
2plotReducedDim(sce, dimred='X_umap', colour_by = 'clusters') + 
3  geom_point(data = slsMST, aes(Dim.1, Dim.2), size = 3, color = "grey15") +
4  geom_path(data = slsMST %>% arrange(Order), 
5              aes(Dim.1, Dim.2, group = Lineage)) +
6  labs(title = "Slingshot MST on UMAP", x = "UMAP 1", y = "UMAP 2")
1emb <- embedCurves(pto, reducedDim(sce,'X_umap'))
2emb <- slingCurves(emb)
1g <- plotReducedDim(sce, 'X_umap', text_by = 'clusters', colour_by='clusters')
2for (path in emb) {
3    embedded_curves.df <- data.frame(path$s[path$ord,])
4    g <- g + geom_path(data=embedded_curves.df, aes(x=Dim.1, y=Dim.2), linewidth=1.2)
5}
6print(g)

calculate the UMAP coordinates in 3D for later

1sce <- runUMAP( sce, dimred = 'X_pca', 
2                BPPARAM=BiocParallel::MulticoreParam(6), 
3                ncomponents = 3, name = 'UMAP_3D'
4              )

Although this works, it is recommended to calculate the trajectories on the PCA coordinates instead of UMAP. This is because UMAP is a non-linear method and the distances between points are not preserved.

1sce.sling <- slingshot(sce, reducedDim='X_pca', clusterLabels = sce$clusters,
2                     start.clus = 'Ductal', end.clus = c('Alpha', 'Beta', 'Delta', 'Epsilon'), spar = 1.1)

Check MST on PCA coordinates

1slsMST <- slingMST(sce.sling, as.df = TRUE)
2plotReducedDim(sce.sling, dimred='X_pca', colour_by = 'clusters') + 
3  geom_point(data = slsMST, aes(Dim.1, Dim.2), size = 3, color = "grey15") +
4  geom_path(data = slsMST %>% arrange(Order), 
5              aes(Dim.1, Dim.2, group = Lineage)) +
6  labs(title = "Slingshot MST on PCA", x = "PCA 1", y = "PCA 2")

Calculate the fitted curve for PCA coordinates in the UMAP space

1embedded <- embedCurves(sce.sling, "X_umap")
2embedded <- slingCurves(embedded)

Trajectories on cell types

1g <- plotReducedDim(sce.sling, 'X_umap', text_by = 'clusters', colour_by='clusters')
2for (path in embedded) {
3    embedded_curves.df <- data.frame(path$s[path$ord,])
4    g <- g + geom_path(data=embedded_curves.df, aes(x=Dim.1, y=Dim.2), linewidth=1.2)
5}
6print(g)

Trajectories with common pseudotime. Each lineage has its own pseudotime, we can calculate a common pseudotime for all lineages as the mean of the specific pseudotimes.

1sce.sling$Pseudotime <- rowMeans( slingPseudotime(sce.sling), na.rm=TRUE)

Plot pseudotime in UMAP coordinates.

1g <- plotReducedDim(sce.sling, 'X_umap', text_by = 'clusters', colour_by='Pseudotime')
2for (path in embedded) {
3    embedded_curves.df <- data.frame(path$s[path$ord,])
4    g <- g + geom_path(data=embedded_curves.df, aes(x=Dim.1, y=Dim.2), linewidth=1.2)
5}
6print(g)

Let’s plot the trajectories on 3D. First we must calculate the embedded curves

1embedded <- embedCurves(sce.sling, 'UMAP_3D')

and plot as an interactive plot to visualize the 3D structure

 1library(dplyr)
 2library(plotly)
 3
 4plot.df <- as.data.frame( reducedDim( sce.sling, 'UMAP_3D' ) )
 5plot.df$cluster <- sce.sling[['clusters']]
 6colnames(plot.df) <- c('UMAP1', 'UMAP2', 'UMAP3', 'cluster')
 7redDim <- 'UMAP'
 8
 9p <- plot_ly( ) %>% # colors = trace.colors ) %>%
10  add_trace( data = plot.df,
11             x = ~UMAP1, y = ~UMAP2, z = ~UMAP3,
12             color = ~cluster,
13             type = 'scatter3d',
14             mode = 'markers',
15             marker = list( size = 3, opacity = 1 )
16           )
17
18for (i in 1:4) {
19  embedded_curves.tmp <- slingCurves(embedded)[[i]] # only 1 path
20  embedded_curves.tmp.df <- data.frame(embedded_curves.tmp$s[embedded_curves.tmp$ord,])
21  p <- p %>% 
22    add_trace( data = embedded_curves.tmp.df,
23               x = ~UMAP1, y = ~UMAP2, z = ~UMAP3, 
24               type = 'scatter3d', mode = "lines",
25               name = paste0("Lineage ", i),
26               line = list( width = 5, opacity = 1 )
27             )
28}
29
30p %>% 
31  layout( scene = list( xaxis = list(title = paste0( redDim, '1') ),
32                        yaxis = list(title = paste0( redDim, '2') ),
33                        zaxis = list(title = paste0( redDim, '3') )
34                      )
35        )
 1plot.df$Pseudotime <- sce.sling$Pseudotime
 2
 3p <- plot_ly( ) %>% # colors = trace.colors ) %>%
 4  add_trace( data = plot.df,
 5             x = ~UMAP1, y = ~UMAP2, z = ~UMAP3,
 6             type = 'scatter3d',
 7             mode = 'markers',
 8             showlegend = FALSE,
 9             marker = list( color = ~Pseudotime,
10                            size = 3,
11                            colorscale = 'Viridis', # Viridis color scale
12                            colorbar = list( # Add color scale bar,
13                              title = 'Pseudotime',
14                              x = 1.01,       # horizontal position
15                              y = 0.4,        # vertical position
16                              len = 0.8,      # length (0 to 1, relative to plot height)
17                              thickness = 30  # width in pixels
18                            ), 
19                            opacity = 1
20                          )
21           )
22
23for (i in 1:4) {
24  embedded_curves.tmp <- slingCurves(embedded)[[i]] # only 1 path
25  embedded_curves.tmp.df <- data.frame(embedded_curves.tmp$s[embedded_curves.tmp$ord,])
26  p <- p %>% 
27    add_trace( data = embedded_curves.tmp.df,
28               x = ~UMAP1, y = ~UMAP2, z = ~UMAP3, 
29               type = 'scatter3d', mode = "lines",
30               name = paste0("Lineage ", i),
31               line = list( width = 5, opacity = 1 )
32             )
33}
34
35p %>% 
36  layout( scene = list( xaxis = list(title = paste0( redDim, '1') ),
37                        yaxis = list(title = paste0( redDim, '2') ),
38                        zaxis = list(title = paste0( redDim, '3') )
39                      )
40        )

Genes changing along trajectories

First we must normalize the gene expressin matrix

1sce.sling <- logNormCounts(sce.sling, assay.type="X")
2sce.sling
## class: SingleCellExperiment 
## dim: 17750 3696 
## metadata(5): clusters_coarse_colors clusters_colors day_colors
##   neighbors pca
## assays(4): X spliced unspliced logcounts
## rownames(17750): Xkr4 Mrpl15 ... Gm29650 Erdr1
## rowData names(1): highly_variable_genes
## colnames(3696): AAACCTGAGAGGGATA AAACCTGAGCCTTGAT ... TTTGTCATCGAATGCT
##   TTTGTCATCTGTTTGT
## colData names(11): clusters_coarse clusters ... Pseudotime sizeFactor
## reducedDimNames(3): X_pca X_umap UMAP_3D
## mainExpName: NULL
## altExpNames(0):

Get the genes changing along the pseudotime using splines with the TSCAN package.

1library(TSCAN)
2# scran::testPseudotime(...)
3pseudo <- TSCAN::testPseudotime( sce.sling,  
4                          pseudotime=sce.sling$Pseudotime, 
5                          BPPARAM=BiocParallel::MulticoreParam(6)
6                        )

Select significant genes.

1# ensembldb::filter(., FDR < 0.05)
2pseudo.sig <- pseudo %>%
3  as.data.frame() %>%
4  dplyr::filter( FDR < 0.05 ) %>%
5  arrange( FDR ) 

Visualize the top 100 genes in a heatmap.

1topgenes <- rownames(pseudo.sig)[1:100]
2
3plotHeatmap(sce.sling, 
4    order_columns_by='Pseudotime', 
5    colour_columns_by="clusters", 
6    features = topgenes,
7    center=TRUE,
8    scale=TRUE)

Spermatogenesis Data

Dataset from Hermann et al. (2018) downloaded from the scRNAseq package.

1library(scRNAseq)
2sce.sperm <- HermannSpermatogenesisData(strip=TRUE, location=TRUE)
## loading from cache

## require("ensembldb")
1assayNames(sce.sperm)
## [1] "spliced"   "unspliced"

Check the object.

1sce.sperm
## class: SingleCellExperiment 
## dim: 54448 2325 
## metadata(0):
## assays(2): spliced unspliced
## rownames(54448): ENSMUSG00000102693 ENSMUSG00000064842 ...
##   ENSMUSG00000064369 ENSMUSG00000064372
## rowData names(0):
## colnames(2325): CCCATACTCCGAAGAG AATCCAGTCATCTGCC ... ATCCACCCACCACCAG
##   ATTGGTGGTTACCGAT
## colData names(1): celltype
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

This dataset was originally used for RNA velocity analyses, which uses spliced/unspliced matrices for the velocity calculations. In this analysis we will use the spliced counts as an equivalent to the standard exonic gene counts used in scRNAseq analyses.

Filter non expressed genes

1sce.sperm <- sce.sperm[rowSums(assay(sce.sperm, "spliced"))>0,]
2dim(sce.sperm)
## [1] 29826  2325

Quick single cell preprocessing including 3D UMAP before pseudotime analysis

 1# Quality control:
 2library(scuttle)
 3is.mito <- which(seqnames(sce.sperm)=="MT")
 4sce.sperm <- addPerCellQC(sce.sperm, subsets=list(Mt=is.mito), assay.type="spliced")
 5qc <- quickPerCellQC(colData(sce.sperm), sub.fields=TRUE)
 6sce.sperm <- sce.sperm[,!qc$discard]
 7
 8# Normalization:
 9set.seed(10000)
10library(scran)
## 
## Attaching package: 'scran'

## The following objects are masked from 'package:TSCAN':
## 
##     createClusterMST, quickPseudotime, testPseudotime

## The following object is masked from 'package:TrajectoryUtils':
## 
##     createClusterMST
 1sce.sperm <- logNormCounts(sce.sperm, assay.type="spliced")
 2dec <- modelGeneVarByPoisson(sce.sperm, assay.type="spliced")
 3hvgs <- getTopHVGs(dec, n=2500)
 4
 5# Dimensionality reduction:
 6set.seed(1000101)
 7library(scater)
 8sce.sperm <- runPCA(sce.sperm, subset_row=hvgs, 
 9                BPPARAM=BiocParallel::MulticoreParam(6))
10sce.sperm <- runUMAP(sce.sperm, dimred="PCA", 
11                BPPARAM=BiocParallel::MulticoreParam(6))
12sce.sperm <- runUMAP( sce.sperm, dimred="PCA", 
13                BPPARAM=BiocParallel::MulticoreParam(6), 
14                ncomponents = 3, name = 'UMAP_3D'
15              )

Plot UMAP

1plotUMAP(sce.sperm, colour_by="celltype", text_by="celltype") +
2  theme_bw()

For this test, we will remove the cell without cell type.

1sce.sperm <- sce.sperm[,!is.na(sce.sperm$celltype)]

Check filter in UMAP

1plotUMAP(sce.sperm, colour_by="celltype", text_by="celltype") +
2  theme_bw()

run slingshot on the PCA

1sce.sling <- slingshot(sce.sperm, reducedDim='PCA', clusterLabels = sce.sperm$celltype)

We will skip the MST plot and go directly to embed into UMAP coordinates

1embedded <- embedCurves(sce.sling, "UMAP")
2embedded <- slingCurves(embedded)

Check trajectory in UMAP

1g <- plotReducedDim(sce.sling, 'UMAP', text_by = 'celltype', colour_by='celltype')
2for (path in embedded) {
3    embedded_curves.df <- data.frame(path$s[path$ord,])
4    g <- g + geom_path(data=embedded_curves.df, aes(x=UMAP1, y=UMAP2), linewidth=1.2)
5}
6print(g)

Trajectories with pseudotime. Since there are no multiple lineages there si no need to calculate a common pseudotime.

1g <- plotReducedDim(sce.sling, 'UMAP', text_by = 'celltype', colour_by='slingPseudotime_1')
2for (path in embedded) {
3    embedded_curves.df <- data.frame(path$s[path$ord,])
4    g <- g + geom_path(data=embedded_curves.df, aes(x=UMAP1, y=UMAP2), linewidth=1.2)
5}
6print(g)

Now we can check the trajectories in 3D UMAP coordinates. First step is embedding into 3D UMAP coordinates

1embedded <- embedCurves(sce.sling, "UMAP_3D")
2embedded <- slingCurves(embedded)

Plot in 3D plot

 1plot.df <- as.data.frame( reducedDim( sce.sling, 'UMAP_3D' ) )
 2plot.df$cluster <- sce.sling[['celltype']]
 3colnames(plot.df) <- c('UMAP1', 'UMAP2', 'UMAP3', 'celltype')
 4
 5redDim <- 'UMAP'
 6
 7trace.colors <- c('black', 'blue', 'red', 'orange')
 8names(trace.colors) <- paste0('Trajectory ', 1:4)
 9
10p <- plot_ly( ) %>% # colors = trace.colors ) %>%
11  add_trace( data = plot.df,
12             x = ~UMAP1, y = ~UMAP2, z = ~UMAP3,
13             color = ~celltype,
14             # text = ~paste('Sample:', sample,
15             #               '<br>Cluster:', cluster),
16             # hoverinfo = 'text',
17             type = 'scatter3d',
18             mode = 'markers',
19             marker = list( size = 3, opacity = 1 )
20           )
21
22for (i in 1:length(embedded) ) {
23  path <- embedded[[i]]
24  embedded_curves.tmp.df <- data.frame(path$s[path$ord,])
25  # embedded_curves.tmp.df$cluster <- paste0('Trajectory ', i)
26  p <- p %>% 
27    add_trace( data = embedded_curves.tmp.df,
28               x = ~UMAP1, y = ~UMAP2, z = ~UMAP3, 
29               type = 'scatter3d', mode = "lines",
30               name = paste0("Trajectory ", i),
31               line = list( width = 5, opacity = 1 , color = paste0('Lineage ', i) )
32             )
33}
34
35p %>% 
36  layout( scene = list( xaxis = list(title = paste0( redDim, '1') ),
37                        yaxis = list(title = paste0( redDim, '2') ),
38                        zaxis = list(title = paste0( redDim, '3') )
39                      )
40        )

Now we can check the pseudotime in the 3D UMAP coordinates. We will use the slingPseudotime_1 pseudotime.

 1plot.df$Pseudotime <- sce.sling$slingPseudotime_1
 2p <- plot_ly( ) %>% # colors = trace.colors ) %>%
 3  add_trace( data = plot.df,
 4             x = ~UMAP1, y = ~UMAP2, z = ~UMAP3,
 5             # text = ~paste('Sample:', sample,
 6             #               '<br>Cluster:', cluster),
 7             # hoverinfo = 'text',
 8             type = 'scatter3d',
 9             mode = 'markers',
10             showlegend = FALSE,
11             marker = list( color = ~Pseudotime,
12                            size = 3,
13                            colorscale = 'Viridis', # Viridis color scale
14                            colorbar = list( # Add color scale bar,
15                              title = 'Pseudotime'
16                            ), 
17                            opacity = 1
18                          )
19           )
20
21for (i in 1:length(embedded) ) {
22  path <- embedded[[i]]
23  embedded_curves.tmp.df <- data.frame(path$s[path$ord,])
24  p <- p %>% 
25    add_trace( data = embedded_curves.tmp.df,
26               x = ~UMAP1, y = ~UMAP2, z = ~UMAP3, 
27               type = 'scatter3d', mode = "lines",
28               name = paste0("Trajectory ", i),
29               showlegend = TRUE,
30               line = list( width = 5, opacity = 1 , color = 'black' )
31             )
32}
33
34p %>% 
35  layout( scene = list( xaxis = list(title = paste0( redDim, '1') ),
36                        yaxis = list(title = paste0( redDim, '2') ),
37                        zaxis = list(title = paste0( redDim, '3') )
38                      )
39        )

Genes changing along trajectories

As before, we use TSCAN to find genes changing along the pseudotime

1pseudo <- TSCAN::testPseudotime( logcounts(sce.sling), 
2                          pseudotime=sce.sling$slingPseudotime_1, 
3                          BPPARAM=BiocParallel::MulticoreParam(6)
4                        )

Select significant genes.

1pseudo.sig <- pseudo %>%
2  as.data.frame() %>%
3  dplyr::filter( FDR < 0.05 ) %>%
4  arrange( FDR ) 

Visualize the top 100 genes in a heatmap.

1topgenes <- rownames(pseudo.sig)[1:100]
2
3plotHeatmap(sce.sling, 
4    order_columns_by='slingPseudotime_1', 
5    colour_columns_by="celltype", 
6    features = topgenes,
7    center=TRUE,
8    scale=TRUE)

Conclusions

  • Slingshot can reconstruct pseudotime trajectories for both datasets.
  • Fitting linear models using splines is a good method to detect genes changing along the pseudotime.

Take away

  • It is advised to calculate trajectories on the PCA coordinates and then project them onto tSNE/UMAP coordinates.
  • No “one size fits all” method to calculate pseudotime trajectories.
  • Multiple methods to evaluate differential gene expression along pseudotime.

Further reading

comments powered by Disqus