Generate sc.AnnData by the gene expression file and the spatial coordination
stCluster requires the input as a sc.AnnData. In this section, we will introduce how to generate a sc.AnnData by the csv files.
Data generation
First, we save the DLPFC 151507 slice’s gene expression, spatial coordination, and metadata of spots to csv. To simplify this process, we only save 300 HVGs for each spot.
[1]:
import scanpy as sc
import pandas as pd
from st_datasets.dataset import get_data, get_dlpfc_data
adata, n_cluster = get_data(dataset_func=get_dlpfc_data, id='151507', top_genes=300)
adata = adata[:, adata.var.highly_variable]
gene_expression = pd.DataFrame(adata.X.todense().A, index=adata.obs.index, columns=adata.var.index).to_csv('gene_exp.csv')
coors = pd.DataFrame(adata.obsm['spatial']).to_csv('coors.csv', index=None)
adata.obs.to_csv('metadata.csv')
>>> INFO: Use local data.
>>> INFO: dataset name: dorsolateral prefrontal cortex (DLPFC), slice: 151507, size: (4226, 33538), cluster: 7.(0.381s)
Load the files
Then, we can load those data via the file path.
In the gene expression file, each row is a spot and each column is a gene.
[2]:
gene_exp_file = pd.read_csv('gene_exp.csv', index_col=0)
gene_exp_file
[2]:
| AL357140.1 | EPHA2 | C1QC | AL009181.1 | TEKT2 | NT5C1A | FAM183A | KDM4A-AS1 | AL158840.1 | AC092813.2 | ... | YWHAH | C22orf42 | RFPL2 | PVALB | Z82188.2 | FBLN1 | CPT1B | PCP4 | TFF1 | LINC01678 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAACAACGAATAGTTC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 2.446557 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
| AAACAAGTATCTCCCA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 3.450282 | 0.0 | 0.0 | 0.000000 | 0.0 | 1.208025 | 0.0 | 0.000000 | 1.739366 | 0.0 |
| AAACAATCTACTAGCA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
| AAACACCAATAACTGC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.937048 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 1.378545 | 0.000000 | 0.0 |
| AAACAGCTTTCAGAAG-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 2.403673 | 0.0 | 0.0 | 1.471228 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TTGTTGTGTGTCAAGA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 2.896793 | 0.0 | 0.0 | 2.257376 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
| TTGTTTCACATCCAGG-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 2.259678 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
| TTGTTTCATTAGTCTA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 2.580975 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
| TTGTTTCCATACAACT-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 3.162901 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
| TTGTTTGTGTAAATTC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.645596 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
4226 rows × 300 columns
In the spatial coordination file, each column is an axis.
[3]:
coors_file = pd.read_csv('coors.csv')
coors_file
[3]:
| 0 | 1 | |
|---|---|---|
| 0 | 3276 | 2514 |
| 1 | 9178 | 8520 |
| 2 | 5133 | 2878 |
| 3 | 3462 | 9581 |
| 4 | 2779 | 7663 |
| ... | ... | ... |
| 4221 | 7464 | 6239 |
| 4222 | 5045 | 9466 |
| 4223 | 4218 | 9703 |
| 4224 | 4017 | 7906 |
| 4225 | 5683 | 3359 |
4226 rows × 2 columns
In the spatial coordination file, each column is a metadata.
[4]:
spot_metadata_file = pd.read_csv('metadata.csv', index_col=0)
spot_metadata_file
[4]:
| in_tissue | array_row | array_col | cluster | |
|---|---|---|---|---|
| AAACAACGAATAGTTC-1 | 1 | 0 | 16 | Layer_1 |
| AAACAAGTATCTCCCA-1 | 1 | 50 | 102 | Layer_3 |
| AAACAATCTACTAGCA-1 | 1 | 3 | 43 | Layer_1 |
| AAACACCAATAACTGC-1 | 1 | 59 | 19 | WM |
| AAACAGCTTTCAGAAG-1 | 1 | 43 | 9 | Layer_6 |
| ... | ... | ... | ... | ... |
| TTGTTGTGTGTCAAGA-1 | 1 | 31 | 77 | Layer_3 |
| TTGTTTCACATCCAGG-1 | 1 | 58 | 42 | Layer_6 |
| TTGTTTCATTAGTCTA-1 | 1 | 60 | 30 | WM |
| TTGTTTCCATACAACT-1 | 1 | 45 | 27 | Layer_6 |
| TTGTTTGTGTAAATTC-1 | 1 | 7 | 51 | Layer_1 |
4226 rows × 4 columns
generate adata
Next, we can generate the sc.AnnData object by the stCluster.
[5]:
from stCluster.utils import gen_adata
adata = gen_adata(gene_exp_file, coors_file, spot_metadata_file, gene_exp_file.columns.to_list(), gene_exp_file.index.to_list())
adata
[5]:
AnnData object with n_obs × n_vars = 4226 × 300
obs: 'in_tissue', 'array_row', 'array_col', 'cluster'
obsm: 'spatial'
The gene expression matrix is saved at adata.X as a sparse matrix. The spatial coordination can be accessed at adata.obsm['spatial']