PROTREC is an algorithm for predicting and validating missing proteins in proteomics data based on Kong W, Wong B J H, Gao H, et al. PROTREC: A probability-based approach for recovering missing proteins based on biological networks. Journal of Proteomics, 2022, 250: 104392. Here, we also include other three common used network based methods (same as in the literature) for a better comparison.
PROTREC is a novel probability-based scoring scheme that estimates the probability of a protein being present in a screen.
Specifically, to calculate PROTREC probability for a protein \(x\), we first find all the protein complexes \(z_i\) containing protein \(x\). Not all complex we should consider since some complex contains so less proteins. The default complex size threshold is 5, you may change it by changing `Min size of protein complex' in the parameter setting under `Run your own datasets' section.
Then, we calculate the probability of a complex \(z_i\in z\) being present:\[p(z_i)=\frac{\sum\limits_{x_i\in L}(1-FDR)}{|z_i|}\]\(x_i\) denotes a protein inside the complex \(z_i\). \(L\) denotes the set of proteins reported by the proteomic screen. If \(x_i\) is reported by the proteomics screen \(L\), then its prior probability is \((1-FDR)\), where \(FDR\) is the false discovery rate of \(L\). The default \(FDR\) is 0.01, you may change it by changing `FDR of proteomic screen' in the parameter setting under `Run your own datasets' section.
PROTREC assume protein \(x\) being present in a sample is dependent on the joint probability of it being present if its complex is formed, and the probability it is present if its constituent complex is not formed. Since there might be multiple protein complexes, PROTREC computes the probability of a protein \(x\) being present in a sample being screened using each of the complexes that the protein \(x\) is a member of and returns the maximum:\[p(x)=\max_{z_i\in z}\{p(x|z_i)p(z_i)+p(x|\overline{z_i})p(\overline{z_i})\}\]This way, we can calculate all protein's PROTREC score. We can sort the proteins by their score and predict unreported proteins above a given PROTREC score threshold as predicted missing proteins. By default, we use 0.95 as the cutoff. You may change your own cutoff by changing `PROTREC score threshold' in the parameter setting under `Run your own datasets' section.
FCS tests whether a network is significantly enriched given the observed proteins. Given a set of observed proteins in a proteomics screen \(S\), and a list of component proteins \(M\) from protein complex \(C\), an observed overlap \(O\), which is expressed as: \[O=\frac{|S\cap M|}{|M|}\]To determine if the overlap \(O\) is significant, a set \(N\) of randomized complexes of size are generated using a reference pool of unique proteins drawn from the complexes \(C\). Default we choose \(N\) as 1000, but you may change the number by changing `Number of iterations' in the parameter setting under `Run your own datasets' section. Among the randomized complexes, a vector of null overlaps, \(N_j\) is generated. For the \(j^{th}\) randomized complex, which comprises the set of proteins \(K_j\), \(N_j\), is defined as follows:\[N_j=\frac{|S\cap K_j|}{|K_j|}\]The empirical p-value is the proportion of null overlaps in \(N_j\) greater than or equal to the observed overlap \(O\). For the \(i^{th}\) complex \(C_i\) in the complex vector, its p-value, \(pval_i\) is:\[pval_i=\frac{\sum\limits_{j=1}^{N}[N_j\ge 0]}{N}\]If the FCS p-value falls below a significance p-value threshold, then all member proteins of the complex, including the unobserved ones, are predicted as present.
In HE, the set of observed proteins are compared against a vector of protein complexes. Given a total number of proteins \(N\), with \(M\) of these belonging to a complex and \(n\) of these proteins in the differential set, the probability \(P\) that \(b\) or more proteins from the differential set are associated by chance with the complex is given by:\[P(X\ge b)=\sum\limits_{i=b}^{min(n,M)}\frac{C_n^iC_{N-n}^{M-i}}{C_N^M}\]\(P(X\ge b)\) is the HE p-value. A complex is declared significant if the HE p-value falls below the threshold.
Gene Set Enrichment Analysis (GSEA) uses a Kolmogorov-Smirnov (KS) statistic. Here, the two-sample KS test is used to evaluate if the distribution of ranks based on the t-statistic of proteins in a complex differs from that of proteins outside the complex. Denoting proteins in the complex as the set \(C\) and proteins outside the complex as the set \(C'\), the KS-statistic \(KS_{C,C'}\)is expressed as:\[KS_{C,C'}=\max\limits_x|F_{1,C}(x)-F_{2,C'}(x)|\]where \(F_{1,C}(x)\) and \(F_{2,C'}(x)\) are respectively the fraction of proteins in \(C\) and \(C'\) whose rank is higher than the rank \(x\). The null hypothesis is rejected at a significance threshold if\[KS_{C,C'}\ge c(\alpha)*\sqrt{\frac{|C|+|C'|}{|C|*|C'|}}\]Where \(c(\alpha)\) is the critical value at a given alpha level.
Notably, For FCS, HE and GSEA, the default p-value threshold is 0.05. You may change your own cutoff by changing `p-val cutoff for FCS, HE and GSEA' in the parameter setting under `Run your own datasets' section.
One proteomics expression datasets are provided. The renal cancer dataset (RC) comprises 12 normal (RC_N) and 12 cancer (RC_C) samples. We provide RC_N sample as an example. Press the button you may check how does the sample looks like.
Protein complex is obtained from CORUM. We use CORUM complex 2018 release human dataset as our default complex database. You may check how the complex looks like by clicking here
You can submit your own dataset and protein complex to get the protein inference result. If it is your first time using this tool, you may refer to Example section to see how it works. If no file is submitted, the program will use default file. We encourage users to use the default protein complex.
Dataset (Default: example dataset)
Protein Complex (Default: CORUM database 2018 release)
Click here to get your verification key after putting in your email address.
Click here to send your data to us. We will execute your data and send results back to your email.
We are a research group comprised of biodata scientists, computational biologists and education technologists in the School of Biological Sciences and Lee Kong Chian School of Medicine, Nanyang Technological University.
Our lab is focused on the development of statistical approaches for analysing and resolving platform-specific idiosyncrasies in multi-omics data; identifying and resolving confounding issues such as batch effects, technical bias and missing values in high-dimensional data; and developing robust biomarker and drug target prediction techniques using a combination of machine learning and enhanced in silico validation techniques. Our lab is also interested in Bio-education, with an emphasis on the use of new AI-based technologies, text-mining and high-impact pedagogical practices (experiential learning), to enhance the quality of biological and biotechnological education. Here is more information about our lab.