Aug 1, 2020

5’-UTR of Coronaviruses Mutates Twice Slower Than the Whole Genome, SARS-CoV-2 Does Not Differ in This Respect


Coronaviruses, particularly betacoronaviruses, belong among usual causative agents of the common cold and respiratory infection symptoms, together with common cold picornavirus, influenza viruses, adenoviruses, human respiratory syncytial virus, and parainfluenza viruses. Newly appeared strains caused epidemics of severe acute respiratory syndrome (SARS), virus SARS-CoV-1 in 2003 and virus SARS-CoV-2 in 2019 (reviewed here).


In view of 15 million diagnosed cases and 600,000 death caused by CoV-2 by 20 July 2020, tens of thousands of publications try to elucidate the origin of the new coronavirus, which is known to be hosted by many mammals, and specifically by bats and pangolins in China. The virus has been broadly studied in China since 2003, and many new strains have been reported, some of them artificially prepared (here). Attempts to explain the path from an animal host to human have not been successful. The closest animal coronavirus published so far, Bat CoV RaTG13, has 96% RNA homology (also called % similarity or % identity). As the epidemic started close to the high security Virology Institute in Wuhan, an escape route from the laboratory is among the considered scenarios, also in view of the numerous such cases and the low level of security in the Chinese laboratories (here, here), and in view of the known cases of selling the animals from the labs to meat market (here); an escape was considered also by the director of the coronavirus laboratory in Wuhan, Dr. Zheng-Li Shi (here).


Naturally, the first structural feature considered as important for high infectivity is the protein sequence of the spike protein, which binds to the human receptor ACE2. It was found that an insertion of 12-nucleotides into the viral RNA resulted in four extra amino acids in positions 681-684 of the spike protein which may have improved contagiousness of the virus; these four extra residues are unique to this human virus and are not found in any other species (here). Additional elements seem to be highly important for the virus, including the envelope protein E, which seems very conservative, many coronaviruses having 100% homology in the protein amino acid sequence, but CoV-2 seems to be rather different (here).

Thus, CoV-2 does differ from other coronaviruses in various aspects. Other regions of the viral RNA have been considered as the source of CoV-2 special properties, including the first 265 bases in the 5’-untranslated region (5’-UTR). The current text examines whether the SARS-CoV-2 virus differs from other coronaviruses in the mutation rate (mutation extent, diversity) of its the 5’-UTR.


The sequences of several coronaviruses are compared here using BLAST of NCBI (here), and the divergence among their 5’-UTR segments is compared with the divergence among their whole genomes, thereby examining the question whether the starting segment of 260 bases is more conserved than the whole genome in coronaviruses. 

The following 11 frequently mentioned coronaviruses are considered (accession numbers and the publishing date are given):
1. SARS-CoV-2 (MT192773), Mar 2020,
2. Bat coronavirus (DQ648857), 2005,
3. Bat SARS-like coronavirus (GQ153547), 2010,
4. SARS-CoV-2 (MT764166.1), Jul 2020,
5. Bat CoV RaTG13 (MN996532.1), Jan 2020, so far closest to CoV-2,
6. Human coronavirus OC43 (NC005147.1), 2005, mild common cold symptoms,
7. SARS-CoV-1 SIN25000 (AY283794.1), 2003, 1st SARS epidemic,
8. Human CoV 229E (MF542265.1), 2016, mild common cold symptoms.
9. Bat SARS-like coronavirus (MG772934.1), 2018,
10. Bat SARS-like coronavirus SHC014 (KC881005), 2013, replicates in human but is not virulent,
11. Mouse SARS-like coronavirus SARS-CoV MA-15 (DQ49700.8), 2007, virulent in mouse and converted to human-virulent by
      incorporating the spike from bat SHC014, making chimera SHC014-MA15 in 2015.

Doublets formed from the above genomes, xth and yth, are compared below (x,y) by BLAST to obtain the homology (% identity) in their whole genomes, as well as in their first 260 bases. The genome sizes n1 and n2 of the compared viruses are given below (n1/n2), followed by homology h1% in the whole sequence, and homology h2% in the 260 base segment of the 5’-UTR. The mutation extent may be characterized by % divergence, i.e. % fraction of differing sequences (1 – h); so that if homology is 90%, the divergence is 10%. Ratio R of the divergence in the 260 base segment and the divergence in the whole RNA genome is calculated and given in the curly brackets below:
R = (100 - h2)/(100 - h1).

In the first stage, CoV-2 is compared with several other coronaviruses, and in the second stage, various non-CoV-2 coronaviruses are compared to each other:

CoV-2 vs others
(x,y)       n1/n2                                   h1                          h2                           R
(1,2)       29890/29741                       81.12%                  90.31%                  {0.51}
(1,3)       29890/29704                       80.85%                  90.31%                  {0.60}
(1,4)       29890/29902                       99.92%                  98.85%
(1,5)       29890/29855                       96.11%                  96.75%                  {0.84}
(1,6)       29855/30738                       65.28%                  n.d.
(1,7)       29890/29711                       80.26%                  90.16%                  {0.50}
(4,7)       29902/29711                       80.24%                  89.39%                  {0.54}
(1,8)       29890/27271                       64.19%                  n.d.
(1,9)       29890/29732                       87.22%                  93.75%                  {0.49}
(1,10)     29890/29787                       80.56%                  89.49%                  {0.54}
(1,11)     29890/29726                       80.24%                  89.88%                  {0.52}

Non-CoV-2 vs each other
(x,y)       n1/n2                                   h1                          h2                           R
(2,3)       29741/29704                       90.32%                  96.11%                  {0.40}
(2,5)       29741/29855                       80.90%                  89.87%                  {0.53}
(2,7)       29741/29711                       89.59%                  95.51%                  {0.43}
(2,9)       29741/29732                       82.03%                  88.49%                  {0.64}
(3,5)       29704/29855                       80.83%                  89.54%                  {0.55}
(3,7)       29704/29711                       89.25%                  97.17%                  {0.26}
(3,9)       29704/29732                       83.21%                  88.03%                  {0.71}
(5,6)       29855/30738                       65.50%                  n.d.
(5,7)       29855/29711                       80.12%                  89.96%                  {0.51}
(5,9)       29855/29732                       87.13%                  95.12%                  {0.38}
(5,10)     29855/29787                       80.45%                  88.94%                  {0.57}
(5,11)     29855/29726                       80.10%                  89.36%                  {0.53}
(6,7)       30738/29711                       66.35%                  n.d.
(6,8)       30738/27271                       65.89%                  n.d.
(7,9)       29711/29732                       81.26%                  87.76%                  {0.65}

It can be seen that the common cold-like viruses 6 and 8 are most different from all other viruses and from each other as well, corresponding to their great evolutionary distance; BLAST could not determine homology % for their short 260 segments (n.d.). The closest to each other, of course, are two CoV-2, viruses 1 and 4, even though they are not identical.

The mutual homologies (% identities) h1 among the genomes of different coronavirus species are in the range of 65% to 96% (= 4% to 35% divergences). The two closest species in the group (except for two CoV-2 viruses 1 and 4 having 99.9% homology) are human SARS-CoV-2 and bat CoV-RaTG13 (viruses 1,5), having 96.1% common sequences. Such a difference in coronaviruses may correspond up to about 100 years of normal separate evolution (here), but quicker events can be considered, including recombination, accelerated mutation rate, or artificial intervention.     

The homologies h2 among the 5’-UTR segments are always higher than corresponding h1 values (except for doublets comprising two CoV-2 strains or remote viruses 6 and 8). Shortly, the coronavirus mutation extent of the 5’-UTR is lower than the mutation extent of the whole genome, which confirms the importance of the starting segment.


Sequence divergence % of 5’-UTR
When comparing the mutation extents in the 5’-UTR and the whole genome, (1 - h1)/(1 - h2), the ratios R of about 0.5 are obtained, showing that the RNA mutations occur in the initial segment twice as slowly as in the whole genome. Specifically, R values comprising CoV-2 are in the range of 0.49 to 0.84, the mean value being 0.57; R values comprising only non-CoV-2 viruses are in the range of 0.26 to 0.71, the mean value being 0.52. Thus, CoV-2 exhibits slightly higher mutation extent in the 5’-UTR than the other coronaviruses, but the difference is not too significant. The difference of 0.05 (DR=0.57-0.52) between the group comprising CoV-2 and the group comprising only other viruses is too small in relation to the whole observed R range of 0.26 to 0.84; moreover, the ranges of both groups, 0.26-0.71 and 0.49-0.84, broadly overlap.

Importantly, R for doublet (1,7) is 0.50, and R for doublet (4,7) is 0.54. So that DR for two groups that both comprise CoV-2 (two different strains of CoV-2) is 0.04. Consequently, the difference DR of 0.05 for two groups, of which one comprises CoV-2 and one not, is not significant.

By the way, two strains of CoV-2 (1,4) differ in their 5’-UTR segments more than in their whole genomes, which may or may not result from slight sequencing errors.

Thus, SARS-CoV-2 does not differ from other coronaviruses in the mutation rate of its 5’-UTR, when measured by the sequence divergence of 5’-UTR relatively to the whole genome.

Nucleotide replacements in 5’-UTR
The numbers of base changes (NBC) in the first 260 bases were compared as follows. A doublet from the CoV-2 group, and a doublet from the non-CoV-2 group to be compared, were chosen, so that both have nearly the same overall genome homology h1; the NBC were then calculated for each of the doublets from the values of h2. For example, CoV-2 comprising doublet (1,9) has h1 of 87.22%, and non-CoV-2 comprising doublet (5,9) has nearly the same h1 of 87.13%; the NBC values are calculated from h2 = (100-h2)*260/100, namely:
NBC(1,9) = (100-93.75)*260/100 = 16 for CoV-2 doublet, and
NBC(5,9) = (100-95.12)*260/100 = 13 for non-CoV-2 doublet.
It means that SARS-CoV-2 differs from bat coronavirus MT192773 in 16 bases of 260 in the 5’-UTR, whereas two other coronaviruses 5 and 9 (having also about 87% genome homology) differ from each other in 13 bases of 260 in the 5’-UTR.

Four possible doublets among the considered cases provided four comparisons as follows:

(1,2) of 81.12 %h versus (7,9) of 81.26 %h: NBC = 25 bases for CoV-2 versus 32 for non-CoV-2
(1,3) of 80.85 %h versus (3,5) of 80.83 %h: NBC = 25 bases for CoV-2 versus 27 for non-CoV-2
(1,7) of 80.26 %h versus (5,7) of 80.12 %h: NBC = 26 bases for CoV-2 versus 26 for non-CoV-2
(1,9) of 87.22 %h versus (5,9) of 87.13 %h: NBC = 16 bases for CoV-2 versus 13 for non-CoV-2

When comparing CoV-2 with non-Cov-2, the divergence of 5’-UTR was higher in CoV-2 in 1 case (16 bases versus 13 bases), was the same in CoV-2 and non-CoV-2 in 1 case (26 versus 26), and was lower in CoV-2 in 2 cases (25 versus 27, and 25 versus 32). All these differences between CoV-2 and non-Cov-2 are in accordance with random changes, and the differences do not imply unexpected increased mutation changes in 5’-UTR of CoV-2 (for example, when the probabilities of the base differences are evaluated by using the Poisson distribution, or otherwise).

So, SARS-CoV-2 does not differ from other coronaviruses in the mutation rate of its 5’-UTR, when measured by the number of base changes.

Insertions – deletions
The alignment of 5’-segments of 260-bases shows, for CoV-2 virus 1 and non-CoV-2 virus 2, one 2-base deletion and one 1-base insertion, beside 25 single base replacements. The alignment for two non-CoV-2 viruses, viruses 7 and 9, shows one 2-base deletion, one 2-base insertion, and one one-base insertion, beside about 32 single base replacements.  

SARS-CoV-2 does not seem to differ from other coronaviruses in the mutation rate of its 5’-UTR, when assessed by the deletion-insertion events in 5’-UTR.


The origin of SARS-CoV-2 has not been explained so far, the same as the origin of SARS-CoV-1. Although escape of CoV-2 from one of the Wuhan labs seems hardly refutable, the origin of its genome remains unclear. The genome may have been artificially edited or not; many publications relate to the mysterious origin of the virus (for example, hereherehereherehere), and while not supporting a possible artificial intervention in its structure, their findings do not disprove such intervention, and still less an eventual lab escape. 

Whatever the origin of the CoV-2 genome sequence, the comparison of the mutation rate in its 5’-untranslated region with other coronaviruses does not indicate any unexpected difference. The mutation is about twice slower in the 5’-UTR than in the whole genome for all checked coronaviruses, but the results do not indicate that SARS-CoV-2 is less conservative in its 5’-UTR than other coronaviruses.

No comments:

Post a Comment