Protein Engineering
The following constitutes one of the two projects I am currently working on as a postdoctoral researcher at the Saito Laboratory at Tokyo, Japan, in the field of protein engineering.
Proteins are essential molecules that play a crucial role in the functioning of the body. Before they are recognized as components of food, they are vital elements for the body, made up of building blocks called amino acids. These amino acids are arranged in long sequences, giving each protein unique properties that impact the body’s overall health and proper functioning. Proteins are involved in nearly every biological process, including cell structure, immune response, and the regulation of enzymes and hormones. Without proteins, our bodies would not be able to grow, repair, or maintain a healthy balance of physiological functions.

Research in protein variants and mutations consist in changing one or several amino acids of a known protein to improve its properties, which is critical for the development of new medicines or vaccines. By understanding how specific changes in protein sequences affect their function, it is possible to design proteins with improved therapeutic properties. However, targeting the right amino acids to obtain the desired improvements is not only a very challenging task, but the subsequent conception of the new protein is also a very laboratory intensive method, called mutagenesis, with no guarantee that the improvement will be effective.
In the recents years, with the rise of artificial intelligence and especially Large Language Models (LLM), which have shown impressive results in language related applications, the field of protein engineering has taken advantage of the sequential nature of proteins to develop Protein Language Models. With the hope of decoding the “language” of proteins, this approach models the amino acids as “words” and the proteins as “sentences”, and has now become the standard method in protein engineering, for its effectiveness.
In my postdoctoral research, I use those models to develop new artificial intelligence methods that predict what amino acids variation could lead to an actual improvement in protein properties. More specifically, I developped a data augmentation method that enabled the possibility to use data from past protein engineering effort, without relying on synthetic data. Protein dataset are usually limited and I showed that the possibility to increases the diversity of the dataset using closely related proteins allowed to significantly improve the predicitive performances of machine learning models.
