Mapping gene expression in space#
This tutorial shows how to use the MappingProblem
for transferring gene expression from single-cell data to spatial data.
Mapping single-cell atlases to spatial transcriptomics data is a crucial analysis steps to integrate cell type annotation across technologies. Furthermore, it is often the case that imaging-based spatial transcriptomics technologies only measure a limit set of transcripts, which could be nonetheless used as anchor features to map the gene expression of unobserved genes from single-cell data.
For this purpose, we will use a preprocessed and normalized dataset of the Drosophila embryo, originally used in NovoSpaRc [Nitzan et al., 2019]. In fact, the MappingProblem
is essentially a reimplementation of the original Gromov-Wasserstein formulation from NovoSpaRc, yet with a some of technical innovations that makes it more scalable.
Note
For this tutorial, Squidpy is needed for spatial data plotting. You can either install it with:
pip install squidpy
orpip install moscot[spatial]
as an optional dependency ofmoscot
See also
TODO: link to other tutorials
Imports and data loading#
import warnings
import moscot as mt
from moscot import datasets
from moscot.problems.space import MappingProblem
import seaborn as sns
import scanpy as sc
import squidpy as sq
warnings.simplefilter("ignore", UserWarning)
Load the drosophila()
dataset.
adata_sc = datasets.drosophila(spatial=False)
adata_sp = datasets.drosophila(spatial=True)
adata_sc, adata_sp
(AnnData object with n_obs × n_vars = 1297 × 2000
obs: 'n_counts'
var: 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg', 'log1p', 'pca'
obsm: 'X_pca'
varm: 'PCs'
layers: 'counts',
AnnData object with n_obs × n_vars = 3039 × 82
obs: 'n_counts'
var: 'n_counts'
uns: 'log1p', 'pca'
obsm: 'X_pca', 'spatial'
varm: 'PCs'
layers: 'counts')
Let’s take a look at the data by plotting couple of summary statistics.
sc.pl.pca(adata_sc, color="n_counts")
sc.pl.pca(adata_sp, color="n_counts")


We can also visualize the spatial dataset in spatial coordinates. Here, it’s important to consider that the spatial datasets actually consists of measurements in 3D. We can visualize them with scanpy.pl.embedding()
, as well as with squidpy.pl.spatial_scatter()
.
sc.pl.embedding(
adata_sp,
basis="spatial",
color="n_counts",
projection="3d",
na_color=(1, 1, 1, 0),
)
adata_sp.obsm["spatial_xy"] = adata_sp.obsm["spatial"][:, [0, 2]]
sq.pl.spatial_scatter(
adata_sp,
spatial_key="spatial_xy",
color="n_counts",
shape=None,
size=20,
figsize=(5, 3),
)

WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`

Mapping single-cell gene expression in space#
With moscot
, it is possible to learn a single-cell-to-spatial mapping by leveraging Fused Gromov-Wasserstein (FGW) optimal transport [Vayer et al., 2018]. A basic description of the algorithm is the following:
Given a set of observations that share some features in the metric space, and some other features in different metric spaces, the FGW method aims at finding the optimal matching between these two set of observations, based both on shared and unique features. In our case:
the “shared” metric space could be the one defined by a set of genes that were measured in both single-cell and spatial transcriptomics data.
the “unique” metric spaces are the genes only measured in single-cell data and the spatial coordinates for the spatial transcriptomics data.
Prepare the MappingProblem
#
The moscot MappingProblem
interfaces the FGW algorithm in a user-friendly API. First, let’s initialize the MappingProblem
by passing single-cell and spatial AnnData
objects.
mp = MappingProblem(adata_sc=adata_sc, adata_sp=adata_sp)
After initialization, we need to prepare()
the problem. In this particular case, we need to pay attention to 3 parameters:
sc_attr
: specify the attribute inAnnData
that we want to use for the features (genes) in the unique space. Usually, it’s theX
attribute, which contains normalized counts, but a pre-computed PCA could also be used. Furthermore, for theunique
features it might be desirable to use some other type of modality, for instance protein expression or ATAC-seq measurements, that could be stored for example inobsm
.var_names
: specify the set of genes that we desire to be using for theshared
space, that is the common set of genes in both dataset. If set toNone
, it will try to compute the intersection between the single-cell and spatial gene names.joint_attr
: it is possible to also specify some other attributes that might be used for the “shared” features space, for instance, an embedding such as PCA.
If the number of shared features is large enough, it is also possible to compute a PCA on the “shared” gene space (by passing the callback_kwargs
). We advise this if we are in the settings of \(>50\) genes shared between the two datasets.
mp = mp.prepare(
sc_attr={"attr": "obsm", "key": "X_pca"}, kwargs={"callback_kwargs": "local_pca"}
)
INFO Computing pca with `n_comps=30` for `xy` using `adata.X`
Solve the MappingProblem
#
We are now ready to solve()
the problem. The most important parameter to take into account is the alpha
value, which balances the weight of each loss (“unique” vs. “shared” spaces). With alpha close to \(0\), the “shared” space loss is weighted more, with alpha close to \(1\), the “unique” space loss is balanced more. For the purpose of this example, we’ll use alpha=0.5
but we suggest to increase if only a few features are present. It should take maximum one minute on a laptop.
mp = mp.solve()
INFO Solving problem OTProblem[stage='prepared', shape=(3039, 1297)].
Analysis of the transport plan#
The first thing we might be interested in doing is “imputing” genes that were not present in the spatial data. We can do so by using the impute()
method. If you are working on a GPU with large data, you might want to pass device='cpu'
to the method in order to use the CPU memory for this task. For the purpose of the tutorial, we will select few genes from the single-cell data.
genes = ["cic", "jigr1", "nuse", "scb", "chrb"]
adata_imputed = mp.impute(var_names=genes)
sq.pl.spatial_scatter(
adata_imputed,
spatial_key="spatial_xy",
color=genes,
shape=None,
size=20,
figsize=(5, 3),
)
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`

Another analysis function that we provide is computing gene expression correlation between inferred and ground-truth spatial genes. We can do this with the correlate()
method.
corr = mp.correlate()
corr["src", "tgt"]
Dfd 0.784703
twi 0.798351
mfas 0.203456
zen 0.449593
D 0.401366
...
CG17724 0.304835
htl 0.778408
Kr 0.727521
dpn 0.554816
ImpL2 0.522811
Length: 82, dtype: float64
Finally, we provide a helper function to compute “spatial correspondence”, that is, the average expression distance at increasing spatial distance, as originally proposed by NovoSpaRc [Nitzan et al., 2019].
If a strong spatial correspondence is observed (as in this case) it can be useful to increase the alpha
value.
df = mp.spatial_correspondence(max_dist=400)
sns.boxplot(x="index_interval", y="features_distance", data=df)
<AxesSubplot: xlabel='index_interval', ylabel='features_distance'>
