Combining evolutionary and assay-labelled data for protein fitness prediction


Predictive modelling of protein properties has become increasingly important to the field of machine-learning guided protein engineering. In one of the two existing approaches, evolutionarily-related sequences to a query protein drive the modelling process, without any property measurements from the laboratory. In the other, a set of protein variants of interest are assayed, and then a supervised regression model is estimated with the assay-labelled data. Although a handful of recent methods have shown promise in combining the evolutionary and supervised approaches, this hybrid problem has not been examined in depth, leaving it unclear how practitioners should proceed, and how method developers should build on existing work. Herein, we present a systematic assessment of methods for protein fitness prediction when evolutionary and assay-labelled data are available. We find that a simple baseline approach we introduce is competitive with and often outperforms more sophisticated methods. Moreover, our simple baseline is plug-and-play with a wide variety of established methods, and does not add any substantial computational burden. Our analysis highlights the importance of systematic evaluations and sufficient baselines.