Including barcode information via barcode distances#

This example shows how to incorporate lineage information obtained from barcodes in the LineageProblem. Check out moslin [Lange et al., 2023] for examples on real-world data.

Imports and data loading#

from moscot import datasets
from moscot.problems.time import LineageProblem

Simulate data using simulate_data().

adata = datasets.simulate_data(n_distributions=3, key="day", quad_term="barcode")
adata

AnnData object with n_obs × n_vars = 60 × 60
    obs: 'day', 'celltype'
    obsm: 'barcode'

We assume barcodes are saved in obsm.

adata.obsm["barcode"][:10, :]

array([[ 1,  8,  0, 12, 18, 15, 11, 16, 13,  9],
       [ 7, 19,  6,  1, 11,  8,  4, 15, 19,  1],
       [10,  7,  3, 14, 15,  4, 11,  4,  0,  7],
       [10, 19,  1, 18,  0, 14, 13,  5,  2, 12],
       [ 2, 18,  1, 14, 17,  9, 12,  7,  3, 15],
       [ 1,  7, 11,  7, 10,  8, 14, 19,  9, 16],
       [ 9, 13,  5,  5, 13,  9,  2, 15,  0,  4],
       [ 8,  6,  1,  7, 10, 12, 13,  8, 12, 16],
       [ 3,  5, 10,  8,  5,  0,  1,  2,  9, 14],
       [ 7, 11,  5,  2,  2,  4, 14,  0, 10,  5]])

Barcode distance#

Now, we can instantiate and prepare the LineageProblem by specifying the cost.

lp = LineageProblem(adata)
lp = lp.prepare(
    time_key="day",
    lineage_attr={"attr": "obsm", "key": "barcode"},
    cost={"x": "barcode_distance", "y": "barcode_distance", "xy": "sq_euclidean"},
)

INFO     Computing pca with `n_comps=30` for `xy` using `adata.X`                                                  
INFO     Computing pca with `n_comps=30` for `xy` using `adata.X`                                                  

Internally, cost matrices have been computed from the trees using the hamming distance between barcodes. Let us investigate the first few entries of the cost matrix computed from the barcodes.

lp[0, 1].x.data_src[:5, :5]

array([[0. , 1.9, 1.6, 1.8, 1.9],
       [1.9, 0. , 1.9, 1.7, 2. ],
       [1.6, 1.9, 0. , 1.6, 1.7],
       [1.8, 1.7, 1.6, 0. , 1.7],
       [1.9, 2. , 1.7, 1.7, 0. ]])

Similarly, we investigate parts of the cost matrix from the lineage tree corresponding to time point 1.

lp[0, 1].y.data_src[:5, :5]

array([[0. , 1.8, 2. , 1.8, 2. ],
       [1.8, 0. , 2. , 1.6, 1.8],
       [2. , 2. , 0. , 1.8, 2. ],
       [1.8, 1.6, 1.8, 0. , 2. ],
       [2. , 1.8, 2. , 2. , 0. ]])

Note that the gene expression term is still saved as two point clouds. The corresponding cost matrix will be computed by the backend.

lp[0, 1].xy.data_src.shape, lp[0, 1].xy.data_tgt.shape

((20, 30), (20, 30))