Cell-type Annotation for Single-cell Transcriptomics using Deep Learning with a Weighted Graph Neural Network
Recent advance in single-cell RNA sequencing (scRNA-seq) has enabled large-scale transcriptional characterization of thousands of cells in multiple complex tissues, in which accurate cell type identification becomes the prerequisite and vital step for scRNA-seq studies.
To addresses this challenge, we developed a pre-trained cell-type annotation method, namely scDeepSort, using a state-of-the-art deep learning algorithm, i.e. a modified graph neural network (GNN) model. It’s the first time that GNN is introduced into scRNA-seq studies and demonstrate its ground-breaking performances in this application scenario. In brief, scDeepSort was constructed based on our weighted GNN framework and was then learned in two embedded high-quality scRNA-seq atlases containing 764,741 cells across 88 tissues of human and mouse, which are the most comprehensive multiple-organs scRNA-seq data resources to date. For more information, please refer to a preprint in bioRxiv 2020.05.13.094953.
Download scDeepSort-v1.0-cu102.tar.gz from the release page and execute the following command:
pip install scDeepSort-v1.0-cu102.tar.gz
The test single-cell transcriptomics csv data file should be pre-processed by first revising gene symbols according to NCBI Gene database updated on Jan. 10, 2020, wherein unmatched genes and duplicated genes will be removed. Then the data should be normalized with the defalut LogNormalize method in Seurat (R package), detailed in pre-process.R, wherein the column represents each cell and the row represent each gene for final test data, as shown below.
| |Cell 1|Cell 2|Cell 3|... |
| :---: |:---: | :---:| :---:|:---:|
|__Gene 1__| 0 | 2.4 | 5.0 |... |
|__Gene 2__| 0.8 | 1.1 | 4.3 |... |
|__Gene 3__|1.8 | 0 | 0 |... |
| ... | ... | ... | ... |... |
-
The file name of test data should be named in this format: species_TissueNumber_data.csv. For example,
human_Pancreas11_data.csvis a data file containing 11 human pancreas cells. -
The test single-cell transcriptomics csv data file should be pre-processed by first revising gene symbols according to NCBI Gene database updated on Jan. 10, 2020, wherein unmatched genes and duplicated genes will be removed. Then the data should be normalized with the defalut
LogNormalizemethod inSeurat(R package), detailed inpre-process.R, wherein the column represents each cell and the row represent each gene for final test data, as shown below.Cell 1 Cell 2 Cell 3 ... Gene 1 0 2.4 5.0 ... Gene 2 0.8 1.1 4.3 ... Gene 3 1.8 0 0 ... ... ... ... ... ... -
All the test data should be included under the
testdirectory. Human datasets should be under./test/humanand mouse datasets should be under./test/mouse
Use --evaluate to reproduce the results as shown in our paper. For example,
to evaluate the data mouse_Testis199_data.csv, you should execute the following command:
python predict.py --species human --tissue Testis --test_dataset 199 --gpu -1 --evaluate --filetype gz --unsure_rate 2
-
--speciesThe species of cells,humanormouse. -
--tissueThe tissue of cells. See wiki page -
--test_datasetThe number of cells in the test data. -
--gpuSpecify the GPU to use,0for gpu,-1for cpu. -
--filetypeThe format of datafile,csvfor.csvfiles andgzfor.gzfiles. Seepre-process.R -
--unsure_rateThe threshold to define the unsure type, default is 2. Set it as 0 to exclude the unsure type.
Output: the output named as species_Tissue_Number.csv will be under the automatically generated result directory, which contains four columns, the first is the cell id, the second is the original cell type, the third is the predicted main type, the fourth is the predicted subtype if applicable.
Note: to evaluate all testing datasets in our paper, please download them in release page
Use --test to test your own datasets. For example,
to test the data human_Pancreas11_data.csv, you should execute the following command:
python predict.py --species human --tissue Pancreas --test_dataset 11 --gpu -1 --test --filetype csv --unsure_rate 2
-
--speciesThe species of cells,humanormouse. -
--tissueThe tissue of cells. See wiki page -
--test_datasetThe number of cells in the test data. -
--gpuSpecify the GPU to use,0for gpu,-1for cpu. -
--filetypeThe format of datafile,csvfor.csvfiles andgzfor.gzfiles. Seepre-process.R -
--unsure_rateThe threshold to define the unsure type, default is 2. Set it as 0 to exclude the unsure type.
Output: the output named as species_Tissue_Number.csv will be under the automatically generated result directory, which contains three columns, the first is the cell id, the second is the predicted main type, the third is the predicted subtype if applicable.
To train your own model, you should prepare two files, i.e., a data file as descrived above, and a cell annotation file under the ./train directory as the example files. Then execute the following command:
python train.py --species human --tissue Adipose --gpu -1 --filetype gz
python train.py --species mouse --tissue Muscle --gpu -1 --filetype gz
-
--speciesThe species of cells,humanormouse. -
--tissueThe tissue of cells. -
--gpuSpecify the GPU to use,0for gpu,-1for cpu. -
--filetypeThe format of datafile,csvfor.csvfiles andgzfor.gzfiles. Seepre-process.R
Output: the trained model will be under the pretrained directory, which can be used to test new datasets on the same tissue using predict.py as described above.
scDeepSort manuscript is under major revision. For more information, please refer to the preprint in bioRxiv 2020.05.13.094953.. Should you have any questions, please contact Xin Shao at xin_shao@zju.edu.cn, Haihong Yang at capriceyhh@zju.edu.cn, or Xiang Zhuang at 3160105000@zju.edu.cn