Introductions

Reliability in cell type annotation is challenging in single-cell RNA-sequencing data analysis because both expert-driven and automated methods can be biased or constrained by their training data, especially for novel or rare cell types. Although large language models (LLMs) are useful, our evaluation found that only a few matched expert annotations due to biased data sources and inflexible training inputs. To overcome these limitations, we developed the LICT (Large language model-based Identifier for Cell Types) software package using a multi-model integration and “talk-to-machine” strategy. Tested across various single-cell RNA sequencing datasets, our approach significantly improved annotation reliability, especially in datasets with low cellular heterogeneity. Notably, we established an objective framework to assess annotation reliability using the “talk-to-machine” approach, which addresses discrepancies between our annotations and expert ones, enabling reliable evaluation even without reference data. This strategy enhances annotation credibility and sets the stage for advancing future LLM-based cell type annotation methods.

Installation

LICT can be installed by following (https://github.com/Glowworm-cell/LICT)) on Github.

remotes::install_github("Glowworm-cell/LICT")

Preparation before using LICT

To facilitate the use of large language models for cell annotation, the LICT framework has integrated five large language models, interacting with the models through APIs. Notably, the Llama3 70B model utilizes an API hosted on Baidu Cloud servers.

Within LLMCellIdentifier, interaction functions execute Python code in the R environment via the reticulate package (version 1.28) which is a comprehensive tool for R that enables seamless integration and interoperability between R and Python. The internal functions of the LLMCellIdentifier package automatically generate prompts that request large language models to identify cell types associated with a specified set of differential genes. These prompts are utilized to query the models via Python, which executes API calls to five major language models.

Install Python(version >= 3.9.13) and some modules and set up the API key as a system environment variable before running LICT.

Install Python(version >= 3.9.13) and necessary modules

The LICT recommends using the Python version that comes with Miniconda. Users can refer to this website for installing Miniconda: https://docs.anaconda.com/miniconda/miniconda-install/. After installation, Python should be added to the Windows system’s environment variables.

Subsequently, set the Python environment variable path within R. This method allows for the automatic retrieval of Python path:

path <- Sys.which("python")
Sys.setenv(RETICULATE_PYTHON = path)

Alternatively, directly set the path to Python:

Sys.setenv(RETICULATE_PYTHON = ~\MINICONDA\python.exe)

Next, open python.exe within Miniconda and run the following code to install the necessary modules:

import importlib
import subprocess

def install_module(module_name):
    try:
        # Try to import the specified module
        importlib.import_module(module_name)
    except ImportError:
        # If import fails, use pip to install the module
        subprocess.check_call(["pip", "install", module_name])

# List of modules to install
modules_to_install = ["anthropic==0.25.8", "openai==0.28.1", "pathlib", "textwrap", "ipython","google-generativeai"]

# Install each module
for module in modules_to_install:
    install_module(module)

Set up API key as an environment variable

To avoid the risk of exposing the API key or committing the key to browsers, users need to set up the API key as a system environment variable before running LICT.Users can obtain the API keys for ERNIE-4.0 and Llama3 70B from https://console.bce.baidu.com/qianfan/ais/console/applicationConsole/application/v1,
for Claude 3 opus from https://console.anthropic.com/settings/keys,
for ChatGPT-4 from https://platform.openai.com/api-keys,
for Gemini1.5 pro from https://aistudio.google.com/app/apikey.

Set up the API key as a system environment variable before running LICT.It is not mandatory to fill in the API keys for all five major language models here; entering keys for one or more is sufficient for operation. The analysis will be conducted using the large language models corresponding to the API keys entered.Please delete the code for any large language models that are not in use.

####loading package
library(reticulate)
library(dplyr)

## 
## 载入程辑包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(scales)
library(Seurat)

## Attaching SeuratObject

#### Retrieve the API key from the environment variables in the Windows operating system.
reticulate::py_run_string("
import os
import openai
ERNIE_api_key = os.getenv('ERNIE_api_key')
ERNIE_secret_key = os.getenv('ERNIE_secret_key')
GEMINI_api_key = os.getenv('GEMINI_api_key')
openai.api_key = os.getenv('openai_api_key')
Llama3_api_key = os.getenv('Llama3_api_key')
Llama3_secret_key = os.getenv('Llama3_secret_key')
ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')
")
####Here, we have pre-configured the API key in the Windows environment variables.

Alternatively, you can execute the following to set a temporary environment variable for running LICT:

####Replace it with your API key and secret_key
Sys.setenv(Llama3_api_key = 'Replace_your_key')
Sys.setenv(Llama3_secret_key = 'Replace_your_key')
Sys.setenv(ERNIE_api_key = 'Replace_your_key')
Sys.setenv(ERNIE_secret_key = 'Replace_your_key')
Sys.setenv(GEMINI_api_key = 'Replace_your_key')
Sys.setenv(openai.api_key = 'Replace_your_key')
Sys.setenv(ANTHROPIC_API_KEY = "Replace_your_key")
reticulate::py_run_string("
import os
import openai
ERNIE_api_key = os.environ['ERNIE_api_key']
ERNIE_secret_key = os.environ['ERNIE_secret_key']
GEMINI_api_key = os.environ['GEMINI_api_key']
openai.api_key = os.environ['openai.api_key']
Llama3_api_key = os.environ['Llama3_api_key']
Llama3_secret_key = os.environ['Llama3_secret_key']
ANTHROPIC_API_KEY = os.environ['ANTHROPIC_API_KEY']
")

Run LICT

In this tutorial, we will analyze an gastric tumor data set available at http://ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE206785. The data, processed and packaged into a Seurat object, can be accessed here.This dataset comprises 2000 single cells which randomly subset from the gastric tumor data set.

Seurat Data processing

Single cell data with cluster information is required for downstream analysis, user can follow the instruction listed at here, here we use a gastric tumor dataset with cell cluster info as example:

####loading gastric tumor dataset
seurat_obj = readRDS('../../gc.rds')
seurat_obj = FindVariableFeatures(seurat_obj, selection.method = "vst", nfeatures = 2000)
seurat_obj = NormalizeData(seurat_obj)
seurat_obj = ScaleData(seurat_obj)
seurat_obj = RunPCA(seurat_obj, features = VariableFeatures(object = seurat_obj))
seurat_obj = FindNeighbors(seurat_obj, dims = 1:10)
###we recommend partitioning the data into clusters consisting of no more than 15 items each. Should the number of clusters exceed 15, we advise using the subset() to further divide the data, following the FindCluster() operation, into several Seurat objects, each containing fewer than 15 clusters, before proceeding with cell type annotation using LICT.
seurat_obj = FindClusters(seurat_obj, resolution = 0.6)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2000
## Number of edges: 59094
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8814
## Number of communities: 12
## Elapsed time: 0 seconds

###check current object has annotation
unique(Idents(seurat_obj))

##  [1] 6  2  1  3  8  0  11 10 9  7  4  5 
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11

make sure that the seurat_obj has character or numeric annotation

####running Seurat workflow
markers <- FindAllMarkers(object = seurat_obj, only.pos = T, min.pct = 0.25, logfc.threshold = 0.25)

Direct LLMs for cell type annotation

We use LLMCellType() function to simultaneously direct all large language models for cell type annotations analysis.

####loading package
library(LICT)
####input FindAllMarkers() function result, species, top gene number, and tissuename
LLMCelltype_res = LLMCellType(FindAllMarkersResult = markers,
                             species = 'human',
                             topgenenumber = 10,
                             tissuename = 'gastric tumor')

## [1] "ERNIE is analyzing"
## [1] "您已经按照要求的格式给出了答案，我将其重新整理一遍以确保清晰：\n\n1: Cytotoxic T cells\n2: T cells\n3: Macrophages\n4: B cells\n5: B cells\n6: Fibroblasts\n7: B cells\n8: Gastric epithelial cells\n9: Fibroblasts\n10: Endothelial cells\n11: Smooth muscle cells\n12: Enterocytes"
## [1] "Gemini is analyzing"
## [1] "> 1: Cytotoxic T cells \n> 2: T cells \n> 3: Macrophages \n> 4: Plasma cells \n> 5: B cells \n> 6: Fibroblasts \n> 7: Plasma cells \n> 8: Gastric chief cells\n> 9: Fibroblasts \n> 10: Endothelial cells \n> 11: Smooth muscle cells \n> 12:  undefined \n"
## [1] "ChatGPT is analyzing"
## [1] "1: Cytotoxic T cells\n2: T cells\n3: Myeloid cells\n4: B cells\n5: B cells\n6: Fibroblasts\n7: Plasma cells\n8: Epithelial cells\n9: Fibroblasts\n10: Endothelial cells\n11: Smooth muscle cells\n12: Enterocytes"
## [1] "Llama is analyzing"
## [1] "1: Cytotoxic T cell\n2: T cell\n3: Myeloid cell\n4: Plasma cell\n5: B cell\n6: Fibroblast\n7: Plasma cell\n8: Gastric epithelial cell\n9: Fibroblast\n10: Endothelial cell\n11: Smooth muscle cell\n12: Enterocyte"
## [1] "Claude is analyzing"
## [1] "1: T cells\n2: T cells\n3: Macrophages\n4: Plasma cells\n5: B cells\n6: Fibroblasts\n7: Plasma cells\n8: Epithelial cells\n9: Fibroblasts\n10: Endothelial cells\n11: Smooth muscle cells\n12: Enterocytes"

The result of each LLMs will be stored at a list, with name:ERNIE,Gemini,GPT,Llama,Claude, respectively, users could access these result with their names, e.g. LLMCelltype_res$ERNIE

rownames(LLMCelltype_res$ERNIE) = seq(1,nrow(LLMCelltype_res$ERNIE))
head(LLMCelltype_res$ERNIE)

Or alternatively, users can invoke different large language models to analyze data by individually utilizing GPTCellType(), Llama3CellType(), ERNIECellType(), ClaudeCellType(), and GeminiCellType()

Evaluation of LLM’s cell annotation results

To check whether LLMs rendered reliable results, we can use the Validate() function to evaluate the results. Three parameters were needed:
1 The LLM_res parameter, the results generated from the LLMCelltype();
2 Previously loaded Seurat object;
3 The threshold for defining positive genes. Here, we will use 0.6 as the threshold, as stated in the article.That mean cell type annotations with four or more positive marker genes (expressed in over 80% of cells) are considered validated.

####input LLMCelltype() function result, seurat_obj, and Percent
Validate_res = Validate(LLM_res = LLMCelltype_res, seurat_obj = seurat_obj, Percent = 0.8, species = 'human')

## [1] "list"
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : Cytotoxic T cells\nrow 2 : T cells\nrow 3 : Macrophages\nrow 4 : B cells\nrow 5 : B cells\nrow 6 : Fibroblasts\nrow 7 : B cells\nrow 8 : Gastric epithelial cells\nrow 9 : Fibroblasts\nrow 10 : Endothelial cells\nrow 11 : Smooth muscle cells\nrow 12 : Enterocytes\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, IFNG, TBX21, EOMES, CD3E, CD3D, CD3G, CD247, NKG7, HAVCR2, KLRD1, KLRC1, KLRC2\nrow2: CD3E, CD3D, CD3G, CD4, CD8A, CTLA4, FOXP3, GZMB, IFNG, TBX21, IL2, CCR7, SELL, CD28, ICOS\nrow3: CD68, CD163, CD64 (FCGR1A), CD14, CD16 (FCGR3A), ITGAM, MRC1, EMR1, AIF1, MS4A4A, FCGR2A, CCR2, IL10, TNF, CX3CR1\nrow4: CD19, CD20 (MS4A1), CD79A, CD79B, CD22, CD21 (CR2), BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow5: CD19, CD20 (MS4A1), CD79A, CD79B, CD22, CD21 (CR2), BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow6: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2 (ACT2), PDGFRB, VIM, THY1 (CD90), FBN1, ZEB1, S100A4, FN1, SPARC, LOX\nrow7: CD19, CD20 (MS4A1), CD79A, CD79B, CD22, CD21 (CR2), BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow8: ATP4A, ATP4B, GKN1, GKN2, MUC5AC, MUC6, TFF1, TFF2, PGA5, PEPSINOGEN C, GHRL, SST, LIPF, CCKBR, CTSE\nrow9: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2 (ACT2), PDGFRB, VIM, THY1 (CD90), FBN1, ZEB1, S100A4, FN1, SPARC, LOX\nrow10: CD31 (PECAM1), VWF, CD34, CDH5, ENG, F8, KDR, TEK, CLDN5, OCLN, FLI1, ERG, RAMP2, NRP1, GATA2\nrow11: ACTA2 (ACT2), CNN1, TAGLN, MYH11, CALD1, MYLK, TPM2, SMTN, ACTG2, LMOD1, MYL9, RGS5, ITGA8, MYOCD, GUCY1A3\nrow12: FABP2, VIL1, SI, LCT, SLC5A1, ALPI, GLUT2, CDH17, MUC2, XPNPEP2, DPP4, PGA3, AQP8, MRP2, GSTA"

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## ℹ The deprecated feature was likely used in the LICT package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : Cytotoxic T cells \nrow 2 : T cells \nrow 3 : Macrophages \nrow 4 : Plasma cells \nrow 5 : B cells \nrow 6 : Fibroblasts \nrow 7 : Plasma cells \nrow 8 : Gastric chief cells\nrow 9 : Fibroblasts \nrow 10 : Endothelial cells \nrow 11 : Smooth muscle cells \nrow 12 :  undefined \nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, IFNG, TBX21, EOMES, CD3E, CD3D, CD3G, CD247, NKG7, HAVCR2, KLRD1, KLRC1, KLRC2  \nrow2: CD3E, CD3D, CD3G, CD4, CD8A, CTLA4, FOXP3, GZMB, IFNG, TBX21, IL2, CCR7, SELL, CD28, ICOS  \nrow3: CD68, CD163, CD64, CD14, CD16, ITGAM, MRC1, EMR1, AIF1, MS4A4A, FCGR2A, CCR2, IL10, TNF, CX3CR1  \nrow4: CD19, CD20, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27  \nrow5: CD19, CD20, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27  \nrow6: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2, PDGFRB, VIM, THY1, FBN1, ZEB1, S100A4, FN1, SPARC, LOX  \nrow7: CD19, CD20, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27  \nrow8: ATP4A, ATP4B, GKN1, GKN2, MUC5AC, MUC6, TFF1, TFF2, PGA5, PEPSINOGEN C, GHRL, SST, LIPF, CCKBR, CTSE  \nrow9: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2, PDGFRB, VIM, THY1, FBN1, ZEB1, S100A4, FN1, SPARC, LOX  \nrow10: CD31, VWF, CD34, CDH5, ENG, F8, KDR, TEK, CLDN5, OCLN, FLI1, ERG, RAMP2, NRP1, GATA2  \nrow11: ACTA2, CNN1, TAGLN, MYH11, CALD1, MYLK, TPM2, SMTN, ACTG2, LMOD1, MYL9, RGS5, ITGA8, MYOCD, GUCY1A3  \nrow12: FABP2, VIL1, SI, LCT, SLC5A1, ALPI, GLUT2, CDH17, MUC2, XPNPEP2, DPP4, PGA3, AQP8, MRP2, GSTA  "
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : Cytotoxic T cells\nrow 2 : T cells\nrow 3 : Myeloid cells\nrow 4 : B cells\nrow 5 : B cells\nrow 6 : Fibroblasts\nrow 7 : Plasma cells\nrow 8 : Epithelial cells\nrow 9 : Fibroblasts\nrow 10 : Endothelial cells\nrow 11 : Smooth muscle cells\nrow 12 : Enterocytes\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, IFNG, TBX21, EOMES, CD3E, CD3D, CD3G, CD247, NKG7, HAVCR2, KLRD1, KLRC1, KLRC2  \nrow2: CD3E, CD3D, CD3G, CD4, CD8A, CTLA4, FOXP3, GZMB, IFNG, TBX21, IL2, CCR7, SELL, CD28, ICOS  \nrow3: CD68, CD163, CD64, CD14, CD16, ITGAM, MRC1, EMR1, AIF1, MS4A4A, FCGR2A, CCR2, IL10, TNF, CX3CR1  \nrow4: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27  \nrow5: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27  \nrow6: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2, PDGFRB, VIM, THY1, FBN1, ZEB1, S100A4, FN1, SPARC, LOX  \nrow7: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27  \nrow8: ATP4A, ATP4B, GKN1, GKN2, MUC5AC, MUC6, TFF1, TFF2, PGA5, PEPSINOGEN C, GHRL, SST, LIPF, CCKBR, CTSE  \nrow9: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2, PDGFRB, VIM, THY1, FBN1, ZEB1, S100A4, FN1, SPARC, LOX  \nrow10: CD31, VWF, CD34, CDH5, ENG, F8, KDR, TEK, CLDN5, OCLN, FLI1, ERG, RAMP2, NRP1, GATA2  \nrow11: ACTA2, CNN1, TAGLN, MYH11, CALD1, MYLK, TPM2, SMTN, ACTG2, LMOD1, MYL9, RGS5, ITGA8, MYOCD, GUCY1A3  \nrow12: FABP2, VIL1, SI, LCT, SLC5A1, ALPI, GLUT2, CDH17, MUC2, XPNPEP2, DPP4, PGA3, AQP8, MRP2, GSTA  \n\nFor genes that start with \"CD\", I've replaced those for which there is a common alias that does not include \"CD\" in the listing (e.g., ENG for CD105, if it were listed). If any CD markers mentioned do not have commonly used alternative names outside of the CD nomenclature, they have been left as-is."
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : Cytotoxic T cell\nrow 2 : T cell\nrow 3 : Myeloid cell\nrow 4 : Plasma cell\nrow 5 : B cell\nrow 6 : Fibroblast\nrow 7 : Plasma cell\nrow 8 : Gastric epithelial cell\nrow 9 : Fibroblast\nrow 10 : Endothelial cell\nrow 11 : Smooth muscle cell\nrow 12 : Enterocyte\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, IFNG, TBX21, EOMES, CD3E, CD3D, CD3G, CD247, NKG7, HAVCR2, KLRD1, KLRC1, KLRC2\nrow2: CD3E, CD3D, CD3G, CD4, CD8A, CTLA4, FOXP3, GZMB, IFNG, TBX21, IL2, CCR7, SELL, CD28, ICOS\nrow3: CD68, CD163, CD64, CD14, CD16, ITGAM, MRC1, EMR1, AIF1, MS4A4A, FCGR2A, CCR2, IL10, TNF, CX3CR1\nrow4: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow5: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow6: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2, PDGFRB, VIM, THY1, FBN1, ZEB1, S100A4, FN1, SPARC, LOX\nrow7: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow8: ATP4A, ATP4B, GKN1, GKN2, MUC5AC, MUC6, TFF1, TFF2, PGA5, PEPSINOGEN C, GHRL, SST, LIPF, CCKBR, CTSE\nrow9: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2, PDGFRB, VIM, THY1, FBN1, ZEB1, S100A4, FN1, SPARC, LOX\nrow10: CD31, VWF, CD34, CDH5, ENG, F8, KDR, TEK, CLDN5, OCLN, FLI1, ERG, RAMP2, NRP1, GATA2\nrow11: ACTA2, CNN1, TAGLN, MYH11, CALD1, MYLK, TPM2, SMTN, ACTG2, LMOD1, MYL9, RGS5, ITGA8, MYOCD, GUCY1A3\nrow12: FABP2, VIL1, SI, LCT, SLC5A1, ALPI, GLUT2, CDH17, MUC2, XPNPEP2, DPP4, PGA3, AQP8, MRP2, GSTA"
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : T cells\nrow 2 : T cells\nrow 3 : Macrophages\nrow 4 : Plasma cells\nrow 5 : B cells\nrow 6 : Fibroblasts\nrow 7 : Plasma cells\nrow 8 : Epithelial cells\nrow 9 : Fibroblasts\nrow 10 : Endothelial cells\nrow 11 : Smooth muscle cells\nrow 12 : Enterocytes\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, IFNG, TBX21, EOMES, CD3E, CD3D, CD3G, CD247, NKG7, HAVCR2, KLRD1, KLRC1, KLRC2\nrow2: CD3E, CD3D, CD3G, CD4, CD8A, CTLA4, FOXP3, GZMB, IFNG, TBX21, IL2, CCR7, SELL, CD28, ICOS\nrow3: CD68, CD163, CD64, CD14, CD16, ITGAM, MRC1, EMR1, AIF1, MS4A4A, FCGR2A, CCR2, IL10, TNF, CX3CR1\nrow4: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow5: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow6: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2, PDGFRB, VIM, THY1, FBN1, ZEB1, S100A4, FN1, SPARC, LOX\nrow7: CD19, MS4A1, CD79A, CD79B, CD22, CD21, BANK1, BLNK, MS4A1, IGHD, IGHM, IGLL1, CD38, CD24, CD27\nrow8: ATP4A, ATP4B, GKN1, GKN2, MUC5AC, MUC6, TFF1, TFF2, PGA5, PEPSINOGEN C, GHRL, SST, LIPF, CCKBR, CTSE\nrow9: COL1A1, COL1A2, COL3A1, COL5A2, FAP, ACTA2, PDGFRB, VIM, THY1, FBN1, ZEB1, S100A4, FN1, SPARC, LOX\nrow10: CD31, VWF, CD34, CDH5, ENG, F8, KDR, TEK, CLDN5, OCLN, FLI1, ERG, RAMP2, NRP1, GATA2\nrow11: ACTA2, CNN1, TAGLN, MYH11, CALD1, MYLK, TPM2, SMTN, ACTG2, LMOD1, MYL9, RGS5, ITGA8, MYOCD, GUCY1A3\nrow12: FABP2, VIL1, SI, LCT, SLC5A1, ALPI, GLUT2, CDH17, MUC2, XPNPEP2, DPP4, PGA3, AQP8, MRP2, GSTA"

Validate() will automatically calculated and labeled each input gene with positive_marker or negative_marker, each LLMs results will gathered and stored in a list, users could access these values with LLMs name(ERNIE,Gemini,GPT,Llama,Claude), for example for ERNIE validation result:

head(Validate_res$ERNIE[,c(1,2,4,5)])

Calculating the reliability of LLMs annotation results

Cell type annotations with four or more positive marker genes (expressed in over 80% of cells) are considered reliable. The column “Total reliable” refers to the annotation results of LICT cells being considered ‘reliable’ if more than one among the five types of LLMs cell type annotation is ‘reliable’.

Reliable_Df = Reliable_Df(Validate_res)
Reliable_Df$Clusters = seq(0,nrow(Reliable_Df)-1)
Reliable_Df = Reliable_Df[,c(7,1:6)]
Reliable_Df

Talk-to-machine

If most of cell annotation failed to be reliable, LICT would apply another strategy ‘talk-to-machine’ to refine LLM’s response. Both positive gene and negative gene marker together with additional differential expressed gene in the original datasets will provide to each LLMs and request cell annotation update. This strategy can simply achieve through Feedback_Info(). Users need to provide:
1. validation result from Validate(),
2.Differential expression gene table generated from Seurat::FindAllMarkers()

Validate_Result_to_Df = Validate_Result_to_Df(Validate_res)
####input Validate_Result_to_Df() result and you want to put top how many FindAllMarkers() result next top DEGs to LLMs. Here we use top 11 to 20 DEGs.
interacted_res = Feedback_Info(Validate_Result_to_Df, 11, 20, markers)

## [1] "more Differential expressed gene were extracted from provided DEG table.."
## [1] "positive_marker:"
##  [1] ""                                                                                    
##  [2] "CD69,TRBC2,GIMAP7,RP11.138A9.2,ID2,CD52,CD3E,CD7,IL32,EVL,CD3D"                      
##  [3] "MS4A6A,LYZ,PLAUR,BCL2A1,G0S2,LST1,MS4A7,C5AR1,C15orf48,CXCL2,AIF1"                   
##  [4] "TXNDC5,RP11.290F5.1,IGKV1.12,IGHA2,SEC11C,JSRP1,CD79A,PRDX4,IGKV1.5,UBE2J1"          
##  [5] "HLA.DRA,FCRLA,HLA.DMB,CD22,CD37,LINC00926,HLA.DQA2,HLA.DPA1,SMIM14,STAG3,CD79A,MS4A1"
##  [6] ""                                                                                    
##  [7] "XBP1,CD79A,IGHA1,JCHAIN,FKBP11,IGKC,FKBP2,IGHM,SEC11C,TNFRSF17"                      
##  [8] "SPINK1,MUC1,KRT8,GKN2,LCN2,CA2,KRT18,KRT19,PIGR,SULT1C2,TFF1,TFF2,CTSE"              
##  [9] ""                                                                                    
## [10] ""                                                                                    
## [11] ""                                                                                    
## [12] "SMIM24,FABP2,CLDN3,APOB,SERPINA1,PIGR,PCK1,CLDN4,ADIRF,CES2"                         
## [1] "negative_marker:"
##  [1] ""                                                                 
##  [2] "CD3G,CD4,CD8A,CTLA4,FOXP3,GZMB,IFNG,TBX21,IL2,CCR7,SELL,CD28,ICOS"
##  [3] "CD68,CD163,CD14,ITGAM,MRC1,MS4A4A,FCGR2A,CCR2,IL10,TNF,CX3CR1"    
##  [4] "CD19,CD79B,CD22,BANK1,BLNK,IGHM,CD38,CD27"                        
##  [5] "CD19,CD79B,CD22,BANK1,BLNK,IGHD,IGHM,IGLL1,CD38,CD24,CD27"        
##  [6] ""                                                                 
##  [7] "CD19,CD79B,CD22,BANK1,BLNK,MS4A1,IGHD,IGHM,IGLL1,CD38,CD24,CD27"  
##  [8] "ATP4A,ATP4B,GKN1,GKN2,MUC5AC,MUC6,PGA5,LIPF,CCKBR"                
##  [9] ""                                                                 
## [10] ""                                                                 
## [11] ""                                                                 
## [12] "VIL1,SI,LCT,SLC5A1,ALPI,CDH17,XPNPEP2,DPP4"                       
## [1] "ERNIE is analyzing"
## [1] "Here is the extracted list:\n\n1: Cytotoxic T cells\n2: T cells\n3: Neutrophils\n4: Plasma cells\n5: B cells\n6: Fibroblasts\n7: Plasma cells\n8: Gastric epithelial cells\n9: Fibroblasts\n10: Endothelial cells\n11: Smooth muscle cells\n12: Enterocytes"
## [1] "Llama is analyzing"
## [1] "1: Cytotoxic T cell\n2: T cell\n3: Myeloid cell\n4: Plasma cell\n5: B cell\n6: Fibroblast\n7: Plasma cell\n8: Gastric epithelial cell\n9: Fibroblast\n10: Endothelial cell\n11: Smooth muscle cell\n12: Enterocyte"
## [1] "Gemini is analyzing"
## [1] "> 1: Cytotoxic T cells \n> 2:  T cells\n> 3: Macrophages\n> 4: Plasma cells\n> 5:  B cells\n> 6: Fibroblasts \n> 7: Plasma cells\n> 8:  Mucous neck cells\n> 9: Fibroblasts \n> 10: Endothelial cells \n> 11: Smooth muscle cells \n> 12:  Hepatocytes \n"
## [1] "ChatGPT is analyzing"
## [1] "1: Cytotoxic T cells\n2: Naive T cells\n3: Neutrophils\n4: Plasmablasts\n5: Memory B cells\n6: Fibroblasts\n7: Plasma cells\n8: Gastric gland mucous cells\n9: Fibroblasts\n10: Endothelial cells\n11: Smooth muscle cells\n12: Intestinal epithelial cells"
## [1] "Claude is analyzing"
## [1] "1: CD8+ T cells\n2: CD4+ T cells\n3: Monocytes\n4: Plasma cells\n5: Naive B cells\n6: Fibroblasts\n7: Plasma cells\n8: Epithelial cells\n9: Fibroblasts\n10: Endothelial cells\n11: Smooth muscle cells\n12: Goblet cells"

Feedback_Info() will automatically feedback information to LLMs and try to request each LLM to update their cell type annotation, each LLMs results will gathered and stored in a list, users could access these values with LLMs name(ERNIE,Gemini,GPT,Llama,Claude), for example for ERNIE validation result:

rownames(interacted_res$ERNIE) = seq(1,nrow(interacted_res$ERNIE))
head(interacted_res$ERNIE)

If users already have a desired marker gene list, they can manually input positive_gene and negative_gene for feedback interaction via LLM_interact(), as following format, or substitute the gene symbol in the quote with their desired genes.

inter_res = LLM_interect(positive_gene = list(
  cluster0 = c(),
  cluster1 = c(strsplit(c('CD69,TRBC2,GIMAP7,RP11.138A9.2,ID2,CD52,CD3E,CD7,IL32,EVL,CD3D'),',')),
  cluster2 = c(strsplit(c('MS4A6A,LYZ,PLAUR,BCL2A1,G0S2,LST1,MS4A7,C5AR1,C15orf48,CXCL2,AIF1'),',')),
  cluster3 = c(strsplit(c('TXNDC5,RP11.290F5.1,IGKV1.12,IGHA2,SEC11C,JSRP1,CD79A,PRDX4,IGKV1.5,UBE2J1'),',')),
  cluster4 = c(strsplit(c('HLA.DRA,FCRLA,HLA.DMB,CD22,CD37,LINC00926,HLA.DQA2,HLA.DPA1,SMIM14,STAG3,CD79A,MS4A1'),',')),
  cluster5 = c(),
  cluster6 = c(strsplit(c('XBP1,CD79A,IGHA1,JCHAIN,FKBP11,IGKC,FKBP2,IGHM,SEC11C,TNFRSF17'),',')),
  cluster7 = c(strsplit(c('SPINK1,MUC1,KRT8,GKN2,LCN2,CA2,KRT18,KRT19,PIGR,SULT1C2,TFF1,TFF2,CTSE'),',')),
  cluster8 = c(),
  cluster9 = c(),
  cluster10 = c(),
  cluster11 = c(strsplit(c('SMIM24,FABP2,CLDN3,APOB,SERPINA1,PIGR,PCK1,CLDN4,ADIRF,CES2'),',')),
  ),
  negative_gene = list(
    cluster0 = c(),
    cluster1 = c(strsplit(c('CD3G,CD4,CD8A,CTLA4,FOXP3,GZMB,IFNG,TBX21,IL2,CCR7,SELL,CD28,ICOS'),',')),
    cluster2 = c(strsplit(c('CD68,CD163,CD14,ITGAM,MRC1,MS4A4A,FCGR2A,CCR2,IL10,TNF,CX3CR1'),',')),
    cluster3 = c(strsplit(c('CD19,CD79B,CD22,BANK1,BLNK,IGHM,CD38,CD27'),',')),
    cluster4 = c(strsplit(c('CD19,CD79B,CD22,BANK1,BLNK,IGHD,IGHM,IGLL1,CD38,CD24,CD27'),',')),
    cluster5 = c(),
    cluster6 = c(strsplit(c('CD19,CD79B,CD22,BANK1,BLNK,MS4A1,IGHD,IGHM,IGLL1,CD38,CD24,CD27'),',')),
    cluster7 = c(strsplit(c('ATP4A,ATP4B,GKN1,GKN2,MUC5AC,MUC6,PGA5,LIPF,CCKBR'),',')),
    cluster8 = c(),
    cluster9 = c(),
    cluster10 = c(),
    cluster11 = c(strsplit(c('VIL1,SI,LCT,SLC5A1,ALPI,CDH17,XPNPEP2,DPP4#39;),','))
    ))

Evaluation of LLM’s cell annotation results with feedback

feedback_Validate = Validate(LLM_res = interacted_res, seurat_obj = seurat_obj, Percent = 0.8, species = 'human')

## [1] "list"
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : Cytotoxic T cells\nrow 2 : T cells\nrow 3 : Neutrophils\nrow 4 : Plasma cells\nrow 5 : B cells\nrow 6 : Fibroblasts\nrow 7 : Plasma cells\nrow 8 : Gastric epithelial cells\nrow 9 : Fibroblasts\nrow 10 : Endothelial cells\nrow 11 : Smooth muscle cells\nrow 12 : Enterocytes\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, CD8B, IFNG, TBX21, EOMES, CCL5, CXCL9, CXCL10, NKG7, HAVCR2, KLRC1, KLRD1, KLRF1\n\nrow2: CD3D, CD3E, CD3G, CD4, CD5, CD2, CD28, CTLA4, ICOS, GZMK, CCR7, LCK, ZAP70, FOXP3, IL2RA\n\nrow3: S100A8, S100A9, CD11b (ITGAM), CD16 (FCGR3B), CEACAM8, CXCR1, CXCR2, MPO, LTF, ELANE, PRTN3, AZU1, CTSG, NCF2, FPR1\n\nrow4: MZB1, CD38, CD138 (SDC1), XBP1, BLIMP1 (PRDM1), JCHAIN, IRF4, CD27, CD319 (SLAMF7), FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2\n\nrow5: CD19, CD20 (MS4A1), CD79A, CD79B, CD9, CD24, CD37, CD40, BANK1, MS4A1, PAX5, BLK, HLA-DRA, HLA-DRB1, IGHD\n\nrow6: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3\n\nrow7: MZB1, CD38, CD138 (SDC1), XBP1, BLIMP1 (PRDM1), JCHAIN, IRF4, CD27, CD319 (SLAMF7), FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2\n\nrow8: TFF1, TFF2, MUC5AC, MUC1, GAST, PGA5, ATP4A, ATP4B, GHRL, SST, LGR5, CDX2, PGC, KRT20, GKN1\n\nrow9: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3\n\nrow10: PECAM1, VWF, CDH5, KDR, TIE1, ENG, CLDN5, TEK, FLT1, FLT4, CLEC14A, ROBO4, CD34, SELE, SELP\n\nrow11: ACTA2, MYH11, CNN1, TAGLN, SM22, CALD1, LMOD1, MYLK, TPM1, TPM2, DES, CALM1, CALM2, SMTN, ITGA8\n\nrow12: FABP2, SI, SLC5A1, SLC2A5, VIL1, ALPI, CDX2, MUC2, LYZ, OLFM4, GUCA2A, GUCA2B, REG4, AQP8, CCKBR"
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : Cytotoxic T cells \nrow 2 :  T cells\nrow 3 : Macrophages\nrow 4 : Plasma cells\nrow 5 :  B cells\nrow 6 : Fibroblasts \nrow 7 : Plasma cells\nrow 8 :  Mucous neck cells\nrow 9 : Fibroblasts \nrow 10 : Endothelial cells \nrow 11 : Smooth muscle cells \nrow 12 :  Hepatocytes \nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, CD8B, IFNG, TBX21, EOMES, CCL5, CXCL9, CXCL10, NKG7, HAVCR2, KLRC1, KLRD1, KLRF1  \nrow2: CD3D, CD3E, CD3G, CD4, CD5, CD2, CD28, CTLA4, ICOS, GZMK, CCR7, LCK, ZAP70, FOXP3, IL2RA  \nrow3: S100A8, S100A9, ITGAM, FCGR3B, CEACAM8, CXCR1, CXCR2, MPO, LTF, ELANE, PRTN3, AZU1, CTSG, NCF2, FPR1  \nrow4: MZB1, CD38, SDC1, XBP1, PRDM1, JCHAIN, IRF4, CD27, SLAMF7, FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2  \nrow5: CD19, MS4A1, CD79A, CD79B, CD9, CD24, CD37, CD40, BANK1, MS4A1, PAX5, BLK, HLA-DRA, HLA-DRB1, IGHD  \nrow6: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3  \nrow7: MZB1, CD38, SDC1, XBP1, PRDM1, JCHAIN, IRF4, CD27, SLAMF7, FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2  \nrow8: TFF1, TFF2, MUC5AC, MUC1, GAST, PGA5, ATP4A, ATP4B, GHRL, SST, LGR5, CDX2, PGC, KRT20, GKN1  \nrow9: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3  \nrow10: PECAM1, VWF, CDH5, KDR, TIE1, ENG, CLDN5, TEK, FLT1, FLT4, CLEC14A, ROBO4, CD34, SELE, SELP  \nrow11: ACTA2, MYH11, CNN1, TAGLN, CALD1, LMOD1, MYLK, TPM1, TPM2, DES, CALM1, CALM2, SMTN, ITGA8  \nrow12: FABP2, SI, SLC5A1, SLC2A5, VIL1, ALPI, CDX2, MUC2, LYZ, OLFM4, GUCA2A, GUCA2B, REG4, AQP8, CCKBR  "
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : Cytotoxic T cells\nrow 2 : Naive T cells\nrow 3 : Neutrophils\nrow 4 : Plasmablasts\nrow 5 : Memory B cells\nrow 6 : Fibroblasts\nrow 7 : Plasma cells\nrow 8 : Gastric gland mucous cells\nrow 9 : Fibroblasts\nrow 10 : Endothelial cells\nrow 11 : Smooth muscle cells\nrow 12 : Intestinal epithelial cells\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, CD8B, IFNG, TBX21, EOMES, CCL5, CXCL9, CXCL10, NKG7, HAVCR2, KLRC1, KLRD1, KLRF1  \nrow2: CD3D, CD3E, CD3G, CD4, CD5, CD2, CD28, CTLA4, ICOS, GZMK, CCR7, LCK, ZAP70, FOXP3, IL2RA  \nrow3: S100A8, S100A9, ITGAM, FCGR3B, CEACAM8, CXCR1, CXCR2, MPO, LTF, ELANE, PRTN3, AZU1, CTSG, NCF2, FPR1  \nrow4: MZB1, CD38, SDC1, XBP1, PRDM1, JCHAIN, IRF4, CD27, SLAMF7, FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2  \nrow5: CD19, MS4A1, CD79A, CD79B, CD9, CD24, CD37, CD40, BANK1, MS4A1, PAX5, BLK, HLA-DRA, HLA-DRB1, IGHD  \nrow6: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3  \nrow7: MZB1, CD38, SDC1, XBP1, PRDM1, JCHAIN, IRF4, CD27, SLAMF7, FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2  \nrow8: TFF1, TFF2, MUC5AC, MUC1, GAST, PGA5, ATP4A, ATP4B, GHRL, SST, LGR5, CDX2, PGC, KRT20, GKN1  \nrow9: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3  \nrow10: PECAM1, VWF, CDH5, KDR, TIE1, ENG, CLDN5, TEK, FLT1, FLT4, CLEC14A, ROBO4, CD34, SELE, SELP  \nrow11: ACTA2, MYH11, CNN1, TAGLN, CALD1, LMOD1, MYLK, TPM1, TPM2, DES, CALM1, CALM2, SMTN, ITGA8  \nrow12: FABP2, SI, SLC5A1, SLC2A5, VIL1, ALPI, CDX2, MUC2, LYZ, OLFM4, GUCA2A, GUCA2B, REG4, AQP8, CCKBR  "
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : Cytotoxic T cell\nrow 2 : T cell\nrow 3 : Myeloid cell\nrow 4 : Plasma cell\nrow 5 : B cell\nrow 6 : Fibroblast\nrow 7 : Plasma cell\nrow 8 : Gastric epithelial cell\nrow 9 : Fibroblast\nrow 10 : Endothelial cell\nrow 11 : Smooth muscle cell\nrow 12 : Enterocyte\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, CD8B, IFNG, TBX21, EOMES, CCL5, CXCL9, CXCL10, NKG7, HAVCR2, KLRC1, KLRD1, KLRF1\nrow2: CD3D, CD3E, CD3G, CD4, CD5, CD2, CD28, CTLA4, ICOS, GZMK, CCR7, LCK, ZAP70, FOXP3, IL2RA\nrow3: S100A8, S100A9, ITGAM, FCGR3B, CEACAM8, CXCR1, CXCR2, MPO, LTF, ELANE, PRTN3, AZU1, CTSG, NCF2, FPR1\nrow4: MZB1, CD38, SDC1, XBP1, PRDM1, JCHAIN, IRF4, CD27, SLAMF7, FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2\nrow5: CD19, MS4A1, CD79A, CD79B, CD9, CD24, CD37, CD40, BANK1, MS4A1, PAX5, BLK, HLA-DRA, HLA-DRB1, IGHD\nrow6: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3\nrow7: MZB1, CD38, SDC1, XBP1, PRDM1, JCHAIN, IRF4, CD27, SLAMF7, FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2\nrow8: TFF1, TFF2, MUC5AC, MUC1, GAST, PGA5, ATP4A, ATP4B, GHRL, SST, LGR5, CDX2, PGC, KRT20, GKN1\nrow9: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3\nrow10: PECAM1, VWF, CDH5, KDR, TIE1, ENG, CLDN5, TEK, FLT1, FLT4, CLEC14A, ROBO4, CD34, SELE, SELP\nrow11: ACTA2, MYH11, CNN1, TAGLN, CALD1, LMOD1, MYLK, TPM1, TPM2, DES, CALM1, CALM2, SMTN, ITGA8\nrow12: FABP2, SI, SLC5A1, SLC2A5, VIL1, ALPI, CDX2, MUC2, LYZ, OLFM4, GUCA2A, GUCA2B, REG4, AQP8, CCKBR"
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : CD8+ T cells\nrow 2 : CD4+ T cells\nrow 3 : Monocytes\nrow 4 : Plasma cells\nrow 5 : Naive B cells\nrow 6 : Fibroblasts\nrow 7 : Plasma cells\nrow 8 : Epithelial cells\nrow 9 : Fibroblasts\nrow 10 : Endothelial cells\nrow 11 : Smooth muscle cells\nrow 12 : Goblet cells\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: GZMB, PRF1, CD8A, CD8B, IFNG, TBX21, EOMES, CCL5, CXCL9, CXCL10, NKG7, HAVCR2, KLRC1, KLRD1, KLRF1  \nrow2: CD3D, CD3E, CD3G, CD4, CD5, CD2, CD28, CTLA4, ICOS, GZMK, CCR7, LCK, ZAP70, FOXP3, IL2RA  \nrow3: S100A8, S100A9, ITGAM, FCGR3B, CEACAM8, CXCR1, CXCR2, MPO, LTF, ELANE, PRTN3, AZU1, CTSG, NCF2, FPR1  \nrow4: MZB1, CD38, SDC1, XBP1, PRDM1, JCHAIN, IRF4, CD27, SLAMF7, FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2  \nrow5: CD19, MS4A1, CD79A, CD79B, CD9, CD24, CD37, CD40, BANK1, MS4A1, PAX5, BLK, HLA-DRA, HLA-DRB1, IGHD  \nrow6: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3  \nrow7: MZB1, CD38, SDC1, XBP1, PRDM1, JCHAIN, IRF4, CD27, SLAMF7, FCRL5, TNFRSF17, CXCR4, IGHG1, IGKC, IGLC2  \nrow8: TFF1, TFF2, MUC5AC, MUC1, GAST, PGA5, ATP4A, ATP4B, GHRL, SST, LGR5, CDX2, PGC, KRT20, GKN1  \nrow9: COL1A1, COL1A2, COL3A1, FAP, FN1, PDGFRB, S100A4, ACTA2, VIM, DDR2, SPARC, FBN1, MMP1, MMP2, MMP3  \nrow10: PECAM1, VWF, CDH5, KDR, TIE1, ENG, CLDN5, TEK, FLT1, FLT4, CLEC14A, ROBO4, CD34, SELE, SELP  \nrow11: ACTA2, MYH11, CNN1, TAGLN, CALD1, LMOD1, MYLK, TPM1, TPM2, DES, CALM1, CALM2, SMTN, ITGA8  \nrow12: FABP2, SI, SLC5A1, SLC2A5, VIL1, ALPI, CDX2, MUC2, LYZ, OLFM4, GUCA2A, GUCA2B, REG4, AQP8, CCKBR"

Users could also access these values with LLMs name(ERNIE,Gemini,GPT,Llama,Claude), for example for ERNIE validation result:

head(feedback_Validate$ERNIE[,c(1,2,4,5)])

To fullly leaverage both result, users could use following function to obtain optimal result

Feedback_Reliable_Df = Reliable_Df(feedback_Validate)
Feedback_Reliable_Df$Clusters = seq(0,nrow(Feedback_Reliable_Df)-1)
Feedback_Reliable_Df = Feedback_Reliable_Df[,c(7,1:6)]
intersect_res = intersect_result(Reliable_Df,Feedback_Reliable_Df)

## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."
## [1] "analysis..."

intersect_res

Evaluation of manual cell annotation

Users can also create a list in the following format containing manual cell type annotations, which can then be inputted into the Validate()for evaluation. The row number is the factor level of the annotations.

###Type is manual annotation in metadata
Idents(seurat_obj) = factor(seurat_obj@meta.data$Type)
manual_annotation = list(manual = data.frame(row = c(1:length(levels(Idents(seurat_obj)))), cell_type = levels(Idents(seurat_obj))))
Validate_manual = Validate(LLM_res = manual_annotation, seurat_obj = seurat_obj, Percent = 0.8, species = 'human')

## [1] "list"
## [1] "Provide key marker genes for the following human cell types, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required:\nrow 1 : B\nrow 2 : CD4+ T\nrow 3 : CD8+ T\nrow 4 : Endothelial\nrow 5 : Epithelial\nrow 6 : Fibroblast\nrow 7 : Glial\nrow 8 : Innate lymphoid\nrow 9 : Mast\nrow 10 : Mural\nrow 11 : Myeloid\nrow 12 : Plasma\nThe format of the final response should be:\n\row1: gene1, gene2, gene3\nrow2: gene1, gene2, gene3\nrowN: gene1, gene2, gene3\n\n...where rowN represents the row number and gene1, gene2, gene3 represent key marker genes.Do genes that start with \"CD\" have alternative names? If they do, please use the aliases. For example, CD105 should be displayed only as ENG, not as CD105."
## [1] "row1: CD19, CD20, CD79A, MS4A1, PAX5, BLNK, CD79B, BANK1, CD22, CD24, IGHD, IGHM, IGKC, CD38, CD27  \nrow2: CD4, GATA3, IL7R, CCR7, SELL, IL2RA, FOXP3, CTLA4, CXCR4, CD40LG, ICOS, SATB1, IL4, IL5, IL13  \nrow3: CD8A, CD8B, PRF1, GZMB, IFNG, TBX21, EOMES, CCL5, CXCR3, GZMH, 2B4 (CD244), CD69, TIGIT, KLRG1, CTLA4  \nrow4: PECAM1, VWF, CDH5, ENG, FLT1, KDR, TEK, CD34, VCAM1, CLDN5, ESAM, ERG, SELE, ICAM1, CD36  \nrow5: CDH1, KRT18, KRT8, KRT19, OCLN, CDH2, MUC1, TJP1, TJP3, CLDN1, EZR, LGALS3, CD24, SNAI1, SNAI2  \nrow6: VIM, FSP1, FAP, PDGFRB, THY1, COL1A1, COL1A2, ACTA2, MMP2, MMP9, S100A4, CD90, PDPN, ENG, ANTXR2  \nrow7: GFAP, S100B, MBP, PLP1, AQP4, MOG, CNP, OLIG2, OLIG1, MAG, SOX10, GALC, CSPG4, CD44, NCAN  \nrow8: RORC, T-BET, GATA3, EOMES, AHR, ID2, IL7R, IL18R1, IL1RL1, IL23R, CCR6, NKX2-3, KLRG1, CCR5, IL22  \nrow9: KIT, FCER1A, CPA3, MS4A2, HDC, IGHE, FcεRI, CD117, TPSAB1, TPSB2, CMA1, GATA2, IL6, IL33, TNF  \nrow10: ACTA2, TAGLN, CNN1, MYH11, DES, CALD1, PDGFRB, NG2 (CSPG4), SM22α, SMMHC, RGS5, MCAM, AQP1, LUM, ELN  \nrow11: CD14, CD11B (ITGAM), CD68, LYZ, CSF1R, CD16 (FCGR3A), CD64 (FCGR1A), CD32 (FCGR2A), CCR2, CX3CR1, ITGAX, MPO, FPR1, S100A8, S100A9  \nrow12: CD38, CD138 (SDC1), MZB1, XBP1, BLIMP1 (PRDM1), IRF4, CD27, CD319 (SLAMF7), JCHAIN, IGHG1, IGKC, IGLC2, CD19, BCL6, TCL1A"

Calculating the reliability of manual annotation results

df = Validate_manual$manual[,3]
for(i in 1:length(df)){
  n = unlist(strsplit(df[i],','))
  if(length(n)>=4){
    Validate_manual$manual$reliable[i] = 'YES'
  }else{
    Validate_manual$manual$reliable[i] = 'NO'
  }
}
Validate_manual$manual

LICT:Large language model-based Identifier for Cell Types

Wenjin Ye\(^{1,}\), Yuanchen Ma\(^{2,}\)

\(^1\)Center for Stem Cell Biology and Tissue Engineering, Key Laboratory for Stem Cells and Tissue Engineering, Ministry of Education, Sun Yat-Sen University

\(^2\)Department of Gastrointestinal Surgery, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong

\(^*\) corresponding authors

Email: yewj27@mail2.sysu.edu.cn

Introductions

Installation

Preparation before using LICT

Install Python(version >= 3.9.13) and necessary modules

Set up API key as an environment variable

Run LICT

Seurat Data processing

Direct LLMs for cell type annotation

Evaluation of LLM’s cell annotation results

Calculating the reliability of LLMs annotation results

Talk-to-machine

Evaluation of LLM’s cell annotation results with feedback

Evaluation of manual cell annotation

LICT:Large language model-based Identifier for Cell Types

Wenjin Ye\(^{1,*}\), Yuanchen Ma\(^{2,*}\)

\(^1\)Center for Stem Cell Biology and Tissue Engineering, Key Laboratory for Stem Cells and Tissue Engineering, Ministry of Education, Sun Yat-Sen University

\(^2\)Department of Gastrointestinal Surgery, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong

\(^*\) corresponding authors

Email: yewj27@mail2.sysu.edu.cn

Introductions

Installation

Preparation before using LICT

Install Python(version >= 3.9.13) and necessary modules

Set up API key as an environment variable

Run LICT

Seurat Data processing

Direct LLMs for cell type annotation

Evaluation of LLM’s cell annotation results

Calculating the reliability of LLMs annotation results

Talk-to-machine

Evaluation of LLM’s cell annotation results with feedback

Evaluation of manual cell annotation

Wenjin Ye\(^{1,}\), Yuanchen Ma\(^{2,}\)