Robotics and Biology Laboratory

Predicting protein contacts by combining information from sequence and physicochemistry

Motivation

Contact prediction showed immense potential to assist protein structure prediction. Despite recent successes, contact prediction remains a hard problem. The hardness is mainly due to the size of the solution space. Inevitably, information is needed to pinpoint correct contact pairs in that space. Consequently, the most promising approach towards improved contact prediction is to identify, leverage, and combine information sources that are indicative of contacts. We extend beyond current methods by combining three orthogonal sources of information: evolutionary, sequence-based and physicochemistry.

Description of work

The meta approach generally improves the results. Combining multiple sources of information can help to mitigate drawbacks exhibited by individual methods, similar to ensembling in Machine Learning. Though by expanding the amount of information, we inevitably increase the dimensionality of the feature space, which introduces new challenges. First, the high dimensionality of the feature space increases learning complexity, data size, training time and promotes overfitting.
Second, most of the commonly used feature sets have been devised to be used on their own not in context of meta approaches. Different information sources may still overlap, thus not contributing to learning. Therefore, it makes sense to re-evaluate them.

We conducted a feature importance analysis which revealed that the amino acid composition, a widely used feature, can be removed without harming the performance. Our assumption is that it has been rendered redundant by the introduction of evolutionary methods that pursue a similar goal.

Based on the new feature set we develop a new contact predictor (4 hidden layer neural network).

Results

Developed a new contact predictor called S\P-CP that combines evolutionary, physicochemical and sequence-based information. S\P-CP improves the mean precision on 1.5L for CASP11 hard FM targets by 16% over the current state-of-the-art MetaPSICOV. The new and refined feature set has drastically reduced dimensionality (by 75%).