Mapping gene expression in space#

This tutorial shows how to use the MappingProblem for transferring gene expression from single-cell data to spatial data.

Mapping single-cell atlases to spatial transcriptomics data is a crucial analysis steps to integrate cell type annotation across technologies. Furthermore, it is often the case that imaging-based spatial transcriptomics technologies only measure a limit set of transcripts, which could be nonetheless used as anchor features to map the gene expression of unobserved genes from single-cell data.

For this purpose, we will use a preprocessed and normalized dataset of the Drosophila embryo, originally used in NovoSpaRc [Nitzan et al., 2019]. In fact, the MappingProblem is essentially a reimplementation of the original Gromov-Wasserstein formulation from NovoSpaRc, yet with a some of technical innovations that makes it more scalable.

Note

For this tutorial, Squidpy is needed for spatial data plotting. You can either install it with:

  • pip install squidpy or

  • pip install moscot[spatial] as an optional dependency of moscot

See also

  • TODO: link to other tutorials

Imports and data loading#

import warnings

import moscot as mt
from moscot import datasets
from moscot.problems.space import MappingProblem

import seaborn as sns

import scanpy as sc
import squidpy as sq

warnings.simplefilter("ignore", UserWarning)

Load the drosophila() dataset.

adata_sc = datasets.drosophila(spatial=False)
adata_sp = datasets.drosophila(spatial=True)
adata_sc, adata_sp
(AnnData object with n_obs × n_vars = 1297 × 2000
     obs: 'n_counts'
     var: 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
     uns: 'hvg', 'log1p', 'pca'
     obsm: 'X_pca'
     varm: 'PCs'
     layers: 'counts',
 AnnData object with n_obs × n_vars = 3039 × 82
     obs: 'n_counts'
     var: 'n_counts'
     uns: 'log1p', 'pca'
     obsm: 'X_pca', 'spatial'
     varm: 'PCs'
     layers: 'counts')

Let’s take a look at the data by plotting couple of summary statistics.

sc.pl.pca(adata_sc, color="n_counts")
sc.pl.pca(adata_sp, color="n_counts")
../../_images/2cc2d598663ffcf16c0fafb0416453d3ab3cd4e3305ff5ddb4dcba14b506b38d.png ../../_images/238a0824df717cde990ceb33e94e16a213f03e0c32dbef1e80d77d8eb94a9f05.png

We can also visualize the spatial dataset in spatial coordinates. Here, it’s important to consider that the spatial datasets actually consists of measurements in 3D. We can visualize them with scanpy.pl.embedding(), as well as with squidpy.pl.spatial_scatter().

sc.pl.embedding(
    adata_sp,
    basis="spatial",
    color="n_counts",
    projection="3d",
    na_color=(1, 1, 1, 0),
)
adata_sp.obsm["spatial_xy"] = adata_sp.obsm["spatial"][:, [0, 2]]
sq.pl.spatial_scatter(
    adata_sp,
    spatial_key="spatial_xy",
    color="n_counts",
    shape=None,
    size=20,
    figsize=(5, 3),
)
../../_images/59e85856db551aa45dfe31bec4e4660a6cad39a52a441471c8bbd04f7e35594c.png
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`
../../_images/271757cf4cc3cc56dde8c0106ca1030c854558b13931497c2ee360c5fe66888c.png

Mapping single-cell gene expression in space#

With moscot, it is possible to learn a single-cell-to-spatial mapping by leveraging Fused Gromov-Wasserstein (FGW) optimal transport [Vayer et al., 2018]. A basic description of the algorithm is the following:

Given a set of observations that share some features in the metric space, and some other features in different metric spaces, the FGW method aims at finding the optimal matching between these two set of observations, based both on shared and unique features. In our case:

  • the “shared” metric space could be the one defined by a set of genes that were measured in both single-cell and spatial transcriptomics data.

  • the “unique” metric spaces are the genes only measured in single-cell data and the spatial coordinates for the spatial transcriptomics data.

Prepare the MappingProblem#

The moscot MappingProblem interfaces the FGW algorithm in a user-friendly API. First, let’s initialize the MappingProblem by passing single-cell and spatial AnnData objects.

mp = MappingProblem(adata_sc=adata_sc, adata_sp=adata_sp)

After initialization, we need to prepare() the problem. In this particular case, we need to pay attention to 3 parameters:

  • sc_attr: specify the attribute in AnnData that we want to use for the features (genes) in the unique space. Usually, it’s the X attribute, which contains normalized counts, but a pre-computed PCA could also be used. Furthermore, for the unique features it might be desirable to use some other type of modality, for instance protein expression or ATAC-seq measurements, that could be stored for example in obsm.

  • var_names: specify the set of genes that we desire to be using for the shared space, that is the common set of genes in both dataset. If set to None, it will try to compute the intersection between the single-cell and spatial gene names.

  • joint_attr: it is possible to also specify some other attributes that might be used for the “shared” features space, for instance, an embedding such as PCA.

If the number of shared features is large enough, it is also possible to compute a PCA on the “shared” gene space (by passing the callback_kwargs). We advise this if we are in the settings of \(>50\) genes shared between the two datasets.

mp = mp.prepare(
    sc_attr={"attr": "obsm", "key": "X_pca"}, kwargs={"callback_kwargs": "local_pca"}
)
INFO     Computing pca with `n_comps=30` for `xy` using `adata.X`                                                  

Solve the MappingProblem#

We are now ready to solve() the problem. The most important parameter to take into account is the alpha value, which balances the weight of each loss (“unique” vs. “shared” spaces). With alpha close to \(0\), the “shared” space loss is weighted more, with alpha close to \(1\), the “unique” space loss is balanced more. For the purpose of this example, we’ll use alpha=0.5 but we suggest to increase if only a few features are present. It should take maximum one minute on a laptop.

mp = mp.solve()
INFO     Solving problem OTProblem[stage='prepared', shape=(3039, 1297)].                                          

Analysis of the transport plan#

The first thing we might be interested in doing is “imputing” genes that were not present in the spatial data. We can do so by using the impute() method. If you are working on a GPU with large data, you might want to pass device='cpu' to the method in order to use the CPU memory for this task. For the purpose of the tutorial, we will select few genes from the single-cell data.

genes = ["cic", "jigr1", "nuse", "scb", "chrb"]
adata_imputed = mp.impute(var_names=genes)
sq.pl.spatial_scatter(
    adata_imputed,
    spatial_key="spatial_xy",
    color=genes,
    shape=None,
    size=20,
    figsize=(5, 3),
)
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`
../../_images/b4a69eddd8f54b905979f456901399408abaa85c998ca5fbd84b2ca77f84c690.png

Another analysis function that we provide is computing gene expression correlation between inferred and ground-truth spatial genes. We can do this with the correlate() method.

corr = mp.correlate()
corr["src", "tgt"]
Dfd        0.784703
twi        0.798351
mfas       0.203456
zen        0.449593
D          0.401366
             ...   
CG17724    0.304835
htl        0.778408
Kr         0.727521
dpn        0.554816
ImpL2      0.522811
Length: 82, dtype: float64

Finally, we provide a helper function to compute “spatial correspondence”, that is, the average expression distance at increasing spatial distance, as originally proposed by NovoSpaRc [Nitzan et al., 2019]. If a strong spatial correspondence is observed (as in this case) it can be useful to increase the alpha value.

df = mp.spatial_correspondence(max_dist=400)
sns.boxplot(x="index_interval", y="features_distance", data=df)
<AxesSubplot: xlabel='index_interval', ylabel='features_distance'>
../../_images/aaa79bd60d742634608bd3d4bafe355e3b76a12fd02ad673280c8e9f58790767.png