Novel Sequence Searching Algorithm Significantly Boost the Power of AI-based Protein-Protein Complex Structure Prediction

Over the past short couple of years, artificial intelligence (AI) and deep-learning techniques have revolutionised protein-structure predictions, and as such have accelerated our understanding in corelating biological functions with protein/RNA structures. In AI-based protein structure prediction, a crucial issue is how to construct informative multiple sequence alignments (MSAs) because the co-evolutionary information of genes and proteins contained in the MSAs can serve as characteristic input for spatial constraint and 3D structure prediction in deep learning frameworks.

Specifically, during the process of evolution, when a mutation occurs at one amino acid site of a protein and disrupts the interaction between it and other amino acid residues, the protein may become unstable, making it difficult for species with this mutation to survive. However, if the amino acids with which the mutated residue spatially interacts with also mutate at the same time and if these two mutations interact well to stabilize the protein structure, such species can continue to survive. This phenomenon refers to protein and gene co-evolution. Since the proteins in the currently existing organisms have all undergone the rigors of co-evolution over hundreds of millions of years, aligning a vast array of current protein sequences in MSAs can effectively deduce information about protein co-evolution and amino acid spatial distances. Therefore, MSA and the co-evolutionary information are widely applied in AI and deep learning-based 3D protein structure predictions.

In this study, a group of researchers, led by CSI Singapore’s Professor Yang Zhang, who also has a joint appointment with the School of Computing and the School of Medicine at NUS, together with colleagues at the University of Michigan, found that deep MSA can be used to significantly improve the AI-based protein structure predictions, especially for modelling protein-protein interaction (PPI) complexes which is one of the most challenging problems in computational structural biology. Specifically, the group has developed a new MSA construction method, called DeepMSA2, which iteratively search the target sequence through different genome and metagenome sequence databases to collect homologous sequences. The MSAs collected by DeepMSA2 are then used as the input feature of the AI models for final protein structure predictions. In the most recent community-wide CASP15 experiment, the DeepMSA2-enhanced AI structure prediction method, named DMFold-Multimer, achieved a cumulative Z-score nearly 3-time higher than that of the standard AlphaFold2-Multimer program for protein-protein complex structure prediction. Their findings were published in the prestigious journal Nature Methods, on 2 January 2024.

The current state of the art for protein MSA construction was developed four years ago by the Zhang’s lab – DeepMSA, for iterative MSA collections for single-chain proteins. Compared to DeepMSA, the novelty of DeepMSA2 lies in three levels-

First, DeepMSA2 introduced a new deep learning model-based MSA ranking strategy for picking up best MSA for tertiary structure prediction. While traditional methods have typically relied on inherent MSA parameters, such as the number of effective sequences (called Neff) in the MSA, it was found that Neff cannot correctly reflect the co-evolutionary information of MSA, which is essential to AI-based structure predictions, and in many cases MSAs with higher Neff can result in worse structural models than MSAs with lower Neff. By evaluating the quality and relevance of the generated MSAs using deep learning models, DeepMSA2 can prioritize alignments that are most likely to contribute to accurate structure prediction. This approach ensures that the final MSA used for structure prediction includes the most informative sequences, balancing both the diversity of sequences and the reliability of the alignments.

Second, in MSA construction step, an important step of DeepMSA2 is the integration of multiple MSA construction tools, such as HHblits and HMMER, into a unified pipeline. While most of the traditional methods utilizes individual homologous sequence collection tool, the group found that the combination of multiple MSA tools leverages the strengths of each tool, potentially capturing a broader and more accurate representation of sequence diversity and evolutionary relationships than any single tool could achieve. By coupling the capabilities of multiple diverse alignment algorithms, DeepMSA2 can generate MSAs with enhanced depth and quality, which are crucial for accurate protein structure prediction.

Third, for PPI structure prediction, a critical step of MSA construction is the pairing of MSAs of individual component chains. While most of the existing deep learning methods (e.g., AlphaFold2-Multimer etc.) rely on a single paired MSA, the DeepMSA2/DMFold pipeline utilizes multiple alignment pairing/selection strategies, built on the assumption that different sequence pairs can capture various aspects of sequence diversity and evolutionary signals, potentially offering complementary insights into protein-protein complex structure prediction.

When asked about the relevance of their research, Prof. Zhang said, “The major motivation for computational protein structure prediction is to bridge the gap between the low availability of experimental protein structures and the high demanding of the field of protein biology and drug industry. AI-based high-quality protein monomer and complex structure prediction, as boosted by the proposed DeepMSA2 method, have several potential applications across different fields, including biomedicine, drug discovery, and bioengineering”.

The next steps for the DeepMSA2 research could potentially open new avenues in both methodology and application, broadening its impact on the fields of bioinformatics and computational biology. Based on the advancements and findings of DeepMSA2, future directions might include- 1) Extending the Method to RNA structure prediction; 2) Developing a new MSA/pairing/sequence linking method for PPIs; and 3) Combining with Protein Language Models. The team is very excited with the potentials to utilize the new methods and techniques for significantly improving the accuracy of protein/RNA structure prediction and enhancing the power of AI and computational approaches to drug discovery and general human health.

CSI Web Experience Survey Form