This small startup company, which has been established for three years, has for the first time used deep learning language models to synthesize completely new proteins that do not exist in nature, igniting a protein design revolution.
The application of artificial intelligence has greatly accelerated the research of protein engineering.
Recently, a fledgling startup in Berkeley, California has once again made astonishing progress.
Scientists have used a protein engineering deep learning language model similar to ChatGPT, Progen, to achieve AI prediction of protein synthesis for the first time.
These proteins are not only completely different from known ones, with the lowest similarity even being only 31.4%, but they are as effective as natural proteins.
Now, this work has been officially published in a Nature journal.
Paper address: https://www.nature.com/articles/s41587-022-01618-2
This experiment also shows that although natural language processing is developed for reading and writing language texts, it can also learn some basic principles of biology.
Technology comparable to the Nobel Prize
Researchers suggest that this new technology may become more powerful than directed evolution (a Nobel Prize winning protein design technique).
It will inject vitality into the 50 year old field of protein engineering by accelerating the development of new proteins that can be used for almost all purposes, from therapeutic agents to biodegradable plastics
The company, named Profluent, was founded by former Salesforce AI research leader and has received $9 million in startup funding to establish an integrated wet lab and recruit machine learning scientists and biologists.
In the past, it was very laborious to excavate proteins in nature or adjust them to desired functions. Profulent's goal is to make this process effortless.
They did it.
Ali Madani, Founder and CEO of Profluent
Madani stated in an interview that Profulent has designed multiple families of proteins. These proteins have the same function as exemplary proteins and are therefore highly active enzymes.
This task is very difficult and was completed in a zero shot manner, which means that multiple rounds of optimization were not conducted, and even no data from the wet laboratory was provided at all.
And the final designed protein is a highly active protein that usually takes hundreds of years to evolve.
ProGen based on language model
As a type of deep neural network, conditional language models can not only generate semantically and grammatically correct and novel natural language texts, but also use input control labels to guide styles, themes, and more.
Similarly, researchers have developed today's protagonist - ProGen, a 1.2 billion parameter conditional protein language model.
Specifically, ProGen based on Transformer architecture simulates residue interactions through self attention mechanisms and can generate artificial protein sequences across different protein families based on input control labels.
Generate artificial proteins using conditional language models
To create this model, researchers fed the amino acid sequences of 280 million different proteins and allowed them to 'digest' for several weeks.
Then, they fine tuned the model using 56000 sequences from five lysozyme families and information about these proteins.
Progen's algorithm is similar to the model GPT3.5 behind ChatGPT, as it learns the patterns of amino acid ordering in proteins and their relationship with protein structure and function.
Quickly, the model generated one million sequences.
Based on the similarity with natural protein sequences and the naturalness of amino acid syntax and semantics, the researchers selected 100 for testing.
Among them, 66 produced chemical reactions similar to natural proteins that eliminate bacteria in egg whites and saliva.
That is to say, these new proteins generated by AI can also kill bacteria.
The generated artificial proteins are diverse and well expressed in the experimental system
Furthermore, the researchers selected the five proteins with the strongest reactions and added them to the samples of Escherichia coli.
Among them, there are two types of artificial enzymes that can break down the cell wall of bacteria.
By comparing with egg white lysozyme (HEWL), it can be found that their activity is comparable to HEWL.
Subsequently, the researchers conducted imaging using X-rays.
Although the amino acid sequences of artificial enzymes differ from existing proteins by up to 30% and are only 18% identical, their shapes are similar to those of natural proteins and their functions are comparable.
Applicability of Conditional Language Modeling to Other Protein Systems
In addition, for highly evolved natural proteins, only a small mutation may be needed to make it stop working.
But in another round of screening, researchers found that even if only 31.4% of the sequences in AI generated enzymes were identical to known proteins, they could still exhibit considerable activity and similar structures.
Protein design, entering a new era
As can be seen, the working mode of ProGen is very similar to ChatGPT.
ChatGPT can participate in MBA and lawyer exams, and write university papers by learning massive amounts of data.
ProGen learned how to generate new proteins by studying the syntax of how amino acids combine to form 280 million existing proteins.
In an interview, Madani said, "Just like ChatGPT learns human languages like English, we are learning the language of biology and proteins
The performance of artificially designed proteins is much better than proteins inspired by evolutionary processes, "said James Fraser, one of the authors of the paper and a professor of bioengineering and therapeutic science at the University of California, San Francisco School of Pharmacy.
The language model is learning various aspects of evolution, but it is different from the normal process of evolution. We now have the ability to adjust the production of these characteristics to achieve specific effects. For example, making an enzyme have incredible thermal stability, or a preference for acidic environments, or not interacting with other proteins
As early as 2020, Salesforce Research developed ProGen. It is based on natural language programming and was originally used to generate English text.
From previous work, researchers have learned that artificial intelligence systems can self learn grammar and the meanings of words, as well as other basic rules that make writing well-organized.
When you train sequence based models with large amounts of data, their performance in learning structures and rules is very powerful, "said Dr. Nikhil Naik, Director of Artificial Intelligence Research at Salesforce Research and senior author of the paper." They will understand which words can appear simultaneously and how to combine them
Now, we have demonstrated that ProGen has the ability to generate new proteins and have publicly released them, allowing anyone to conduct research based on our findings
As a protein, lysozyme is very small, with a maximum of about 300 amino acids.
But with 20 possible amino acids, there are 20 ^ 300 possible combinations.
This is more than the product of all human beings throughout history, multiplied by the number of grains of sand on Earth, and then multiplied by the number of atoms in the universe.
Given the almost infinite possibilities, it is truly remarkable that Progen was able to design an effective enzyme so easily.
Dr. Ali Madani, founder of Profluent Bio and former research scientist at Salesforce Research, said, "The ability to generate functional proteins from scratch right out of the box indicates that we are entering a new era of protein design
This is a multifunctional new tool that all protein engineers can use, and we look forward to seeing it applied in therapy
At the same time, researchers are still improving ProGen in an attempt to overcome more limitations and challenges.
One of them is that it heavily relies on data.
We have explored improving sequence design by incorporating structure based information, "Naik said." We are also researching how to enhance the model's generative power when you don't have much data on a specific protein family or domain
It is worth noting that there are also some startups trying similar technologies, such as Cradle and Generate Biomedicines from Flagship Pioneering, a biotechnology incubator, but these studies have not yet undergone peer review.
reference material:
https://endpts.com/exclusive-profluent-debuts-to-design-proteins-with-machine-learning-in-bid-to-move-past-ai-sprinkled-on-top/
https://www.newscientist.com/article/2356597-ai-has-designed-bacteria-killing-proteins-from-scratch-and-they-work/
https://www.sciencedaily.com/releases/2023/01/230126124330.htm