About the Journal
Contents All Volumes
Abstracting & Indexing
Processing Charges
Editorial Guidelines & Review
Manuscript Preparation
Submit Your Manuscript
Book/Journal Sales
Contact


Cosmology Science Books
Order from Amazon
Order from Amazon
Order from Amazon
Order from Amazon
Order from Amazon
Order from Amazon
Order from Amazon
Order from Amazon
Order from Amazon
Order from Amazon


Journal of Cosmology, 2010, Vol 10, 3374-3380.
JournalofCosmology.com, August, 2010

Reconstruction of the Molecular Origin of Life

Edward N. Trifonov, Ph.D.,
Genome Diversity Center, Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel.
Department of Functional Genomics and Proteomics, Institute of Experimental Biology, Faculty of Science, Masaryk University, Kamenice 5, Brno CZ-62500, Czech Republic.


Abstract

Instead of starting with one or another plausible chemical scenario of origin of life the author looks for traces of early molecular evolution in modern sequences, assuming that sequences successful in the past are still successful today. Reconstruction of the molecular past on this basis turned out to be very fruitful, resulting in several highly non-trivial predictions confirmed by analysis of modern sequences. The reconstruction covers events from simple repetitive RNA duplex to stages of evolution of genetic code and very first proteins of LUCA (Last Universal Common Ancestor) repertoire. The events prior to the repetitive RNA can be only speculated, but also partially confirmed experimentally, suggesting the path towards experimental reconstruction.

Keywords: polymerization, complementarity, evolution of codons, minigenes, replicator, life definition, first proteins, LUCA



I. RECONSTRUCTIONS

Consecutive transformations from simple to complex is rather common sense view accepted by almost everybody who enters the very speculative hot field of the origin of life. Charles Darwin was, probably, the first to clearly take this stand in his monumental work (Darwin, 1859, in concluding passages): "life... having been originally breathed into a few forms or into one;... from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved". Of modern thinkers, Sidney Brenner (1988) put it in molecular terms: "I assume that the earliest proteins were small peptides of about ten amino acids, and specified by small primitive genes, probably made of RNA... In the next stage, I postulate that the genes become joined together at random and a primitive splicing mechanism concatenates the peptides into longer molecules... This process produced compact protein domains with a characteristic size of about a hundred amino acids from which, by further concatenation, more complex proteins were then assembled". It would be, thus, rather natural to build the cosmology of life along the "simple-to-complex" vector, starting with… what?

There are two ways to go: One route starts at zero point – the completely unknown very first molecular events that, presumably, have lead to what could be called the earliest life. "Presumably", of course means speculatively. There could be many chemically different life start scenarios, and many have been suggested and explored. No matter how attractive intellectually, they all stay in the status of speculation, since the only confirmation of the zero-point predictions would be the construction (creation!) of experimental model of life, never accomplished. The "Creation" at hand, the advanced living matter of which we are the crown part, is so complex, that every resemblance to the zero-point picture is already long lost (is it?). The second route is backwards along the vector. That is, from known modern molecular structures to their earlier precursors, as far back as one could reach. This strategy would be, hopefully, less dependent on speculations as it starts, after all, from solid observations of present reality, contrary to fully speculative past.

The name of the game is reconstruction, a daring attempt to trace those ingenious consecutive steps that have been taken by the Builder-Nature. The starting hypothesis of the reconstruction is that at least some of the earliest molecular structures (amino acid and nucleotide sequences) may have survived the 3.9 billion year journey of perpetuated self-reproductions, being still around, presumably, due to their exceptional performance all the way. This winning property would render the sequences high degree of conservation, perhaps, even full conservation. One, thus, should look for the highly conserved best performing sequences.

Genomes of eukaryotes carry numerous simple repeating sequences. One such sequence, repetition (GCC)n, occasionally spontaneously expands to very high copy number, which is believed to be a molecular cause of several neurodegenerative dysfunctions (e. g., Lenzmeier and Freudenreich, 2003). The (GCC)n repeat, thus, not only survives within the genome but also manifests the ability to expand, ignoring the wellbeing of the host genome and performing the expansion better than any other simple repeat. Interestingly, when the repeats are introduced in bacterial plasmids they do expand as well (Ohshima et al. 1996). This special property of (GCC)n has lead us (Trifonov & Bettecken, 1997) to an outrageous speculation: (GCC)n could be the sequence of one of the earliest RNAs that outcompeted all other simple repeating sequences due to its exceptional expandability. The GCC triplet today encodes alanine. Was not the GCC triplet the very first codon, corresponding to alanine – the highest yield amino acid synthesized abiotically in imitation experiments of Miller (1987)? The burst of further speculations followed. If that one was the first then the next codons to come would have been, naturally, single point mutations of GCC, like GAC, GCA, CCC, etc. Would not they all be the triplets coding for the earliest amino acids? We speculated then, that the earliest amino acids were chemically simplest, they have been served by ancient type of aminoacyl-tRNA synthetases of Class II (Carter, 1993), and they should have been present in primordial environment on Earth, being synthesized abiotically, as in the experiments of Miller (1987). Six amino acids satisfy these conditions: ala (GCN), gly (GGC), asp (GAC), pro (CCC), ser (UCC) and thr (ACC). They should have been encoded by point mutated derivatives of the GCC triplets. And they are still served by such triplets today (in parentheses)! The reader may mind the emotional attitude of the author, but these observations themselves are loud exclamations. The combination of several simple speculations boiled down to one more speculation or rather prediction that we even did not dare to consider: the codons of the billion year old past encoded the same amino acids as they do today. And we have got the "prediction" nearly confirmed. (Note, that the observation above puts the set of the earliest amino acids in correspondence with the set of the earliest codons, not in one-to-one correspondence).

This success in reconstruction of the earliest days of the triplet code developed further in a complete "evolutionary tree" of codons (Trifonov, 2000, 2004), that brought new surprises. It turned out that the temporal order of engagement of amino acids in evolution, calculated as consensus of now more than 100 different chronological criteria, not just three criteria as in (Trifonov & Bettecken, 1997), follows the order of descending thermostability of respective codons paired with their Watson-Crick complements. The amino acids encoded by the complementary codons come in the evolutionary chart simultaneously, like alanine (GCC) and glycine (GGC) or valine (GUC) and aspartate (GAC) (Trifonov, 2004). From the temporal order it follows also that amino acids of Miller are at the top of the list, while the codon capture cases (Osawa et al. 1992) are all the latest amino acids. The abiotic start, complementarity, thermostability, codon capture last – all make sense, which turns the reconstructed evolutionary chart of codons in a very reasonable composite speculation, a basis for further rounds of speculations turning, indeed, into theory via confirmed predictions.

From the way the chart is built it follows that in all cases (in 31 of 32 cases more accurately) the changes in already acquired codons occur only in third positions and (complementarily) in first positions of the codons. As a result the middle bases stay unchanged. For example initial GGC for glycine turns in GGG (glycine as well), and complementarily to CCC (proline). That is, the GGC/GCC pair turns into GGG/CCC pair, so that the middle G and the middle C, respectively, stay unchanged. In one case, at the first step of the ladder, the pair GGC/GCC is changed to GAC/GUC, for aspartate and valine, respectively, next in the temporal order to initial alanine and glycine. In this case the middle letters keep their purine (pyrimidine) quality unchanged. That brings to the following general statement describing one basic property of the chart: during the establishment and evolution of the codon table two independent groups of codons were formed, one with central purines, descending from GGC (gly) and another one with central pyrimidines, descending from GCC (ala), making two parallel (complementary) lines of descendence. The pyrimidine-central codons, two first columns of the standard presentation of the triplet code, correspond to mostly non-polar (hydrophobic) amino acids (phe, leu, ile, met, val, ser, pro, thr and ala). The other family, two last columns of the standard codon table, contains among others all polar amino acids (tyr, his, gln, asn, lys, glu, asp, cys, trp, arg, ser, gly). As serine is encoded by two types of codons, it belongs to both families.

After the codon table has been completed during its evolution, this family separation should have been maintained for some time because of two different pressures. First, mutations of the middle base are likely to have been more frequently of transition type (purine to purine, or pyrimidine to pyrimidine), very much like it happens today. Second, hydrophobic amino acids would have been more often replaced by other hydrophobic residues, and polar to polar ones, respectively, to maintain the balance between these types, important for protein folding and stability. One would expect, therefore, that the early separation in the two families would still be visible in modern amino acid substitutions. This prediction is confirmed, indeed, as the known substitution matrices do split in two independent boxes, with almost all substitutions confined within the boxes (Trifonov, 2006; Gabdank et al. 2006). The 20-letter alphabet protein sequences may now be presented in binary form, with letters A (alanine family) and G (glycine family), leaving the dual serine uncertain. Such sequence can be considered as representing the ancestral form of the protein.

The next round of prediction-confirmation is at the doors. The most ancestral genes, presumably, encoding oligoglycines and (in complementary strand) oligoalanines had been, most probably, short, encoding "of the order of 10 amino acid residues" (Brenner, 1988). The size of 7-8 residues is more realistic as longer peptides simply would not be soluble (Ogata et al. 2000). At a later stage they should have been fused in longer chains. Since long chains of primarily hydrophobic residues of the Ala-family would rather aggregate, while long chains of primarily polar residues of Gly-family would dissolve in extended form that would exclude any folding, the most likely type of fusion products would be alternations of certain unit-size runs of A-type residues and runs of G-type residues. The prediction would be: the ancient alternation of runs of G and A may survive in modern protein sequences in hidden form. Positional cross-correlation of these types of amino acids in large ensembles of prokaryotic protein sequences, indeed, shows the alternation of the A-rich and G-rich sections with the period ~13 residues, that corresponds to the run size 6-7 residues (Trifonov et al. 2001). The size of 7 residues for the ancient unit has been confirmed later by an independent study of traces of ancient RNA hairpins in modern prokaryotic mRNA (Gabdank et al. 2006).

Thus, the reconstruction backwards that resulted in the speculative codon evolution chart yielded also few confirmed steps forward (binary alphabet and minigenes encoding A7 and G7), turning the speculation – the evolutionary chart of codons - into a theory (Trifonov, 2006). Such beating a path is being done, of course, along a unique route within the evolutionary tree or, probably, along its trunk, as the reconstruction did not encounter yet anything that would suggest a bifurcation of the tree.

We started with the presumed (now with good measure of confidence) earliest codons, backwards from triplet expansion diseases, walking then again up the trunk. Let us try to repeat the exercise, now with protein sequences. The whole idea of the reconstruction rests on the assumption that at least some of the earliest functional sequences, of nucleic acids and proteins, are still around, since the times of the last common ancestor or even before. If such highly conserved sequences (motifs) do exist in proteins, then they would be, likely, present in every species, that is, omnipresent. The full conservation of such sequences would be due to their exceptional performance, from the moment of their establishment on the evolutionary scene.

This expectation has been lavishly confirmed by the existing genomic (proteomic) databases (Sobolevsky and Trifonov, 2005; Sobolevsky et al. 2007). We were able to identify 27 such motifs from 6 to 9 residues long, many of which are already known to be conserved elements of longer protein sequences responsible for important cellular processes (Sobolevsky and Trifonov, 2006). Although some of these motifs are completely dissimilar, they all are found to belong to the same large network in the formatted sequence space (Sobolevsky et al. 2007). That is, all sequence fragments of length 20 residues that contain the omnipresent elements, are related to each other, via several intermediate fragments, all of which are close relatives to their immediate neighbors in the space. One can build a tree of gradual branching of the fragments from a common stem (ibid), suggesting that all the omnipresent elements have common origin. Remarkably, in binary presentation most of the omnipresent motifs follow the same consensus (Trifonov, 2009a), which is nothing but alternation of almost perfect A7 with almost perfect G7 (one of seven A is replaced by G, and one of seven G is replaced by A). Thus, the perfect A7 and G7 meet the almost perfect ones in two reconstructions in opposite directions from the first codons up and from modern omnipresent motifs down. In 20-letter alphabet the motifs cluster (fuse) in two sequences that correspond to identified earlier ancient sequence prototypes Aleph and Beth (Berezovsky & Trifonov, 2002; Trifonov & Berezovsky, 2002). These prototypes are found in modern ATP-binding motifs and in ATPases of ABC-transporters. This suggests that the very first genes, according to the reconstructions, have been responsible for energy supply to the primordial cell – more than expected result, though no such clear statement could be uttered before the reconstruction.

Each of the omnipresent elements appears to represent today a different sequence/structure module, a closed loop of the size 25-30 residues (Sobolevsky and Trifonov, 2006) – the structure identified earlier as universal building block of proteins (Berezovsky et al. 2000). The modules, in their turn, appear in modern proteins in various combinations (Sobolevsky et al. 2007). Those of the combinations which are present in every species can be considered as omnipresent cassettes of the ancient modules, representing, presumably, the very first multimodular proteins. These are ABC-cassettes of transporters, cell division proteins (with protease activity), translation initiation and elongation factors, aminoacyl-tRNA synthases and RNA polymerase (Trifonov, 2009b). Interestingly, none of these most ancient protein cassettes is involved in syntheses of amino acids and nucleotides. Their functions are either supply of the monomers (cross-membrane transport and proteolysis of peptides) or polymerization (translation factors, aminoacyl-tRNA synthases, and RNA polymerase).

The LUCA cells, thus, have been still dependent on abiotic syntheses in the environment, as well as the very first molecular steps of life.

The moment of the life start is suggested (Trifonov, 2009a) by the reconstructed evolutionary chart of codons. This is transition point between mere self-reproduction (first complementary messengers GCC7 and GGC7, and first mini-proteins ala7 and gly7), and self-reproduction with variations (appearance of second complementary pair of codons GUC and GAC, for valine and aspartate). The association of the GCC7, GGC7, ala7 and gly7 would then represent the very first composite replicator (Trifonov, 2006). One question, however, remains unanswered: what is the origin of the repeating oligonucleotides? The homo-oligopeptides, probably, could have been synthesized abiotically (Lahav et al. 1978). The nucleobases and nucleoside triphosphates could have been synthesized abiotically as well (Costanzo et al. 2007; Saladino et al. 2009; Powner et al. 2009). The reconstructed elementary protein composition of LUCA cells also suggests that all monomers necessary for condensation of RNA and polypeptides have been provided by environment. Is the abiotic condensation of the repeating RNA sequences possible?

Recent experiments of the group of Di Mauro are very encouraging. Not only they were able to overcome the condensation/degradation barrier (Pino et al. 2008), but some of the long RNA molecules synthesized in water contained all bases, basically, demonstrating potential of the system to produce RNA of mixed sequences (Costanzo et al. 2009). In what follows the author abandons the safe path of reconstruction, engaging in speculations on even earlier stages of life, as a matter of fact – on earlier extinct molecular lifes (plural).

II. SPECULATIONS

The problem for mixed sequence condensation in water is the reverse reaction (e. g., van Holde, 1980). Due to this even tetranucleotides and pentanucleotides may be barely detected in reaction mixture. The presumed abiotic system(s) could contain only short homo- and hetero- oligonucleotides. Di Mauro and colleagues (Pino et al. 2008) discovered that if the condensation is conducted in presence of presynthesized hexariboA it continues well over this size, apparently due to formation of secondary structure within or between the oligoA molecules. It has been found long before that polyA may form hairpin-like structures stabilized by "complementary" A●A base pairs (Brahms et al. 1966). It is not the Watson-Crick pairing, of course, but this special secondary structure, apparently, catalyzed further elongation of the chain. The plausible scenario of the very first breakthrough of the earliest molecular evolution would be, thus, 1) appearance of the hexaA in minute probabilistic amounts, 2) oligoA hairpin formation, and 3) oligoA elongation. As further studies of Di Mauro group indicate (Costanzo et al. 2009) long chains of RNA can then be formed, permissible for certain amount of mixed sequence RNA as well. One could speculate, thus, the following additional stages of the earliest molecular evolution (continuing): 4) synthesis of mixed sequence oligoribonucleotides, 5) first (probabilistic) Watson-Crick complementary complexes, and 6) more efficient synthesis of the complementary complexes (replication). Filling the remaining gap between mixed sequence duplex RNA and repeating sequence duplex RNA is straightforward (continuing): 7) occasional repeating sequences in the mixture of the RNA duplexes take over due to their unique expandability (formation of slippage structure), 8) the most expandable sequences GCC*GGC prevail, and we arrive to the first step of the evolution of the triplet code, as outlined above.

Not often suggestions of experiments on the basis of a theory are appreciated by community of experimentalists, not mentioning suggestions coming from speculations, as above. I do have, however, a feeling that further experimental developments will eventually take the anticipated trend. After all, quite a reward is at stake: experimental remake of emerging life.

Acknowledgements: This work is supported by Czech Ministry of Education (grant SM0021622415). Fruitful discussions with E. Di Mauro are highly appreciated.




References

Berezovsky, I. N., Grosberg, A. Y., Trifonov, E. N. (2000). Closed loops of nearly standard size: common basic element of protein structure. FEBS Letters 466, 283-286.

Berezovsky, I. N., Trifonov, E. N. (2002). Flowering buds of globular proteins: transpiring simplicity of protein organization. Comp Funct Genom, 3, 525-534.

Brahms, J., Michelson, A. M., Van Holde, K. E. (1966). Adenylate oligomers in single-and double-strand conformation. J Mol Biol, 15, 467-488.

Brenner, S. (1988). The molecular evolution of genes and proteins: a tale of two serines. Nature, 334, 528-530.

Carter, C. W. (1993). Cognition, mechanism, and evolutionary relationships in aminoacyl-tRNA synthetases. Annu Rev Biochem, 62, 715-748.

Costanzo, G., Saladino, R., Crestini, C., Ciciriello, F., Di Mauro, E. (2007). Nucleoside phosphorylation by phosphate minerals. J Biol Chem, 282, 16729-6735.

Costanzo, G., Pino, S., Ciciriello, F., Di Mauro E. (2009) Generation of long RNA chains in water J. Biol. Chem, 284, 33206–33216.

Darwin, C., (1859). On the origin of species. Murray, London. Gabdank, I., Barash, D., Trifonov, E. N. (2006). Tracing ancient mRNA hairpins. J Biomol Str Dyn, 24, 163-170.

Lahav, N., White, D., Chang, S. (1978). HYPERLINK "http://apps.isiknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=6&SID=P1heCKDfALJnjAO39fe&page=1&doc=47" Peptide formation in prebiotic era - thermal condensation of glycine in fluctuating clay environments. Science 201, 67-69.

Lenzmeier, B. A., Freudenreich, C. H. (2003). Trinucleotide repeat instability: a hairpin curve at the crossroads of replication, recombination, and repair. Cytogenet Genome Res, 100, 7-24.

Ogata, Y., Imai, E-I., Honda, H., Hatori, K., Matsuno, K. (2000). Hydrothermal circulation of seawater through hot vents and contribution of interface chemistry to prebiotic synthesis. Origin Life Evol Biosph, 30, 527-537.

Ohshima, K., Kang, S., Larson, J. E., Wells, R. D. (1996). HYPERLINK "http://apps.isiknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=1&SID=U1NioMcfj64Ea4bIMdm&page=14&doc=666" Cloning, characterization, and properties of seven triplet repeat DNA sequences. J Biol Chem, 271, 16773-16783.

Osawa, S., Jukes, T. S., Watanabe, K., Muto, A. (1992). Recent evidence for evolution of the genetic code. Microb Rev, 56, 229-264.

Pino, S., Ciciriello, F., Costanzo, G., Di Mauro, E. (2008). Nonenzymatic RNA Ligation in Water. J Biol Chem, 283, 36494-36503.

M. W., Gerland, B., Sutherland, J. D. (2009). Synthesis of activated pyrimidine ribonucleotides in prebiotically plausible conditions. Nature, 459, 239-242.

Saladino, R., Crestini, C., Ciciriello, F., Pino, S., Costanzo, G., Di Mauro, E. (2009). From formamide to RNA: the roles of formamide and water in the evolution of chemical information. Res Microbiol, 160, 441-448.

Sobolevsky, Y., Trifonov, E. N., (2005). Conserved sequences of prokaryotic proteomes and their compositional age. J Mol Evol, 61, 591-596.

Sobolevsky, Y., Trifonov, E. N., (2006). Protein modules conserved since LUCA. J Mol Evol, 63, 622-634.

Sobolevsky, Y., Frenkel, Z. M., Trifonov, E. N. (2007). Combinations of ancestral modules in proteins. J Mol Evol, 65, 640-650.

Trifonov, E. N., (2000).Consensus temporal order of amino acids and evolution of the triplet code. Gene, 261, 139-151.

Trifonov, E. N. (2004). The triplet code from first principles. J Biomolec Str Dyn, 22, 1-11.

Trifonov, E. N. (2006). Theory of early molecular evolution: Predictions and confirmations. In: Eisenhaber, F., (Ed), Discovering Biomolecular Mechanisms with Computational Biology, Landes Bioscience, Georgetown, pp. 107-116.

Trifonov, E. N., (2008). Tracing Life back to elements. Physics of Life Reviews 5, 121-132.

Trifonov, E. N., (2009a). Origin of the genetic code and of the earliest oligopeptides. Res Microbiol, 160, 481-486.

Trifonov, E. N. (2009b). The origin of triplet code and reconstruction of LUCA. Symp. Ital Soc Gen Microb Microb Biotech (SIMGBM), Spoleto.

Trifonov, E. N., Bettecken, T. (1997). Sequence fossils, triplet expansion, and reconstruction of earliest codons. Gene 205, 1-6.

Trifonov, E. N., Kirzhner, A., Kirzhner, V. M., Berezovsky, I. N. (2001). Distinct stages of protein evolution as suggested by protein sequence analysis. J Mol Evol, 53, 394-401.

Trifonov, E. N., Berezovsky, I. N. (2002). Proteomic code. Mol Biol 36, 239-243.




The Human Mission to Mars.
Colonizing the Red Planet
ISBN: 9780982955239

Edited by
Sir Roger Penrose & Stuart Hameroff

ISBN: 9780982955208

Abiogenesis
The Origins of LIfe
ISBN: 9780982955215

Life on Earth
Came From Other Planets
ISBN: 9780974975597

Biological Big Bang
Panspermia, Life
ISBN: 9780982955222

20 Scientific Articles
Explaining the Origins of Life

ISBN 9780982955291

Copyright 2010, 2011, All Rights Reserved