PIC: Protein Importance Calculator

Welcome to PIC!

Introduction

Human essential proteins (HEPs) are indispensable for individual viability and development. However, experimental methods to identify HEPs are often costly, time-consuming, and labor-intensive. In addition, existing computational methods predict HEPs only at the cell line level, but HEPs vary across living human, cell line, and animal models.

Here we develop a sequence-based deep learning model, Protein Importance Calculator (PIC), by fine-tuning a pretrained protein language model. PIC not only substantially outperforms existing methods for predicting HEPs but also provides comprehensive prediction results across three levels: human, cell line, and mouse.

History

November, 2024: PIC was published online in Nature computational Science
March, 2024: PIC web server was released
November, 2023: 325 PIC models were trained
June, 2023: PIC algorithm was built
February, 2023: PIC original idea was proposed

Contact us
Dr. Qinghua Cui
38 Xueyuan Rd, Department of Biomedical Informatics,
Peking University Health Science Center,
Beijing 100191, China
Email: cuiqinghua@bjmu.edu.cn

Citation: Kang, B., Fan, R., Cui, C. et al. Comprehensive prediction and analysis of human protein essentiality based on a pretrained large language model. Nat Comput Sci (2024). https://doi.org/10.1038/s43588-024-00733-1

Please input your protein sequences in FASTA format example clear

The general prediction results of human total protein essneital scores based on the swiss-prot database

all_general_prediction_results.csv

The specific prediction results of human total protein essneital scores in each cell line based on the swiss-prot database

all_specific_prediction_results.csv

The specific information of 323 human cell lines

all_cell_line_info.csv

Detailed PES scores of all microproteins predicted by PIC models at different levels

all_microproteins_predicted_PES.csv

The source codes of PIC algorithm are both available at here and github website

PIC source codes

Github:https://github.com/KangBoming/PIC

The pretrained ESM-2 model

esm2_t33_650M_UR50D.pt

esm2_t33_650M_UR50D-contact-regression.pt

Introduction

Essential proteins are the expression products of essential genes that perform diverse biological functions closely associated with organismal growth and development. However, prevailing computational methods solely predict essential proteins from a single fixed level, thereby lacking a comprehensive approach to prediction and analysis. Here, we proposed Protein Importance Calculator(PIC), a sequence-based model designed for multi-level essential protein prediction. The overall workflow of our paper is shown below:

Visit PIC

PIC is available at http://www.cuilab.cn/pic.

Predict

1. Input the interested protein sequences in fasta format.
2. Click the 'example' button for example sequences.
3. Click 'Run' to submit your protein sequences.
4. Please be patient for a few minutes, and the prediction result page will appear soon

Result

The prediction results will be shown in a few minutes. The general prediction results are presented in a table format, including Protein ID, Human-level essential score, Mouse-level essential score, Cell-level essential score and Average essential score. Please refer to the Metric section for detailed meanings of each column. For the convenience of users, the general prediction results and the specific prediction results can be downloaded in the result page. The result file1 contains the overall prediction results for the input protein sequences from human-level, mouse-level and cell-level. Moreover, we also provide the specific prediction results of the input protein sequences among 322 human cell lines in the result file2.

Download

If you want to use the multi-level protein essential scores of all known human coding genes for large-scale data analysis, we provide the calculated essential scores of approximately 20000 known proteins at multiple levels stored in CSV format files. Moreover, we also provide the specific information of 322 human cell lines, including their tissue source and disease source. Finally, we have open-source the PIC code on GitHub, allowing anyone to access the resource

Metric interpretation

In our paper, we anticipate that the essentiality of human proteins could be evaluated from three levels: (i) human-level: essential proteins are defined as proteins that are rarely disturbed or truncated in the general population based on large-scale human genome sequencing engineering. (ii) mouse-level: essential proteins are defined as proteins that cause embryonic death in mice after being knocked out. (iii) cell-level: essential proteins are defined as proteins that cause a decrease in cell viability after being knocked out.

Therefore, a total of 324 basic PIC models were trained across three levels: one for human-level (PIC-human), one for mouse-level (PIC-mouse), and 322 for cell-level models. Each cell-level model was considered a basic classifier, and a soft voting strategy in ensemble learning was employed, aggregating the mean of probability values output by PIC models across 322 cell lines. This process resulted in a high-performance classifier represented by PIC-cell.

Here, we defined the probability values output by PIC-human, PIC-mouse and PIC-cell models as the essential score of human protein at the human-level, mouse-level and cell-level. To create a comprehensive metric for evaluating protein essentiality, we took the mean of three level protein essential scores as the average essential score for a certain protein, which comprehensively evaluate protein essentiality from three levels: human, mouse and cell.