Using in-silica Analysis and Reverse Vaccinology Approach for COVID-19 Vaccine Development

Background: The recent pandemic of COVID19 that has struck the world is yet to be battled by a potential cure. Countless lives have been claimed due to the existing pandemic and the societal normalcy has been damaged permanently. As a result, it becomes crucial for academic researchers in the field of bioinformatics to combat the existing pandemic. Materials and Methods: The study involved collecting the virulent strain sequence of SARS-nCoV19 for the country USA against human host through publically available bioinformatics databases. Using in-silica analysis, reverse vaccinology, and 3-D modelling, two leader proteins were identified to be potential vaccine candidates for development of a multi-epitope drug. Results: It was revealed that the two leader proteins ORF1ab MT326102 and MT326715 had the highest extinction coefficient and the lowest score on the GRAVY. Along with the given parameters, these leader proteins were highly stable and were also antigenic in nature. The two selected epitopes were then docked against their respective alleles to obtain the global energy scores, which was the lowest of all possible pairs. Conclusion: The epitopes which displayed the lowest global energy score on docking with the alleles were selected and proposed as successful and potential vaccine candidates for multi-epitope vaccine development.


Introduction
A new infection respiratory disease was first observed in the month of December 2019, in Wuhan, situated in the Hubei province, China. Studies have indicated that the reason of this disease was the emergence of a genetically-novel coronavirus closely related to SARS-CoV. This coronavirus, now named as nCoV-19, is the reason behind the spread of this fatal respiratory disease, now named as COVID-19. The initial group of infections is supposedly linked with the Huanan seafood market, most likely due to animal contact. Eventually, human-to-human interaction occurred and resulted in the transmission of the virus to humans.
Since then, nCoV-19 has been rapidly spreading within China and other parts of World. At the time of writing this article (mid-March 2020), COVID-19 has spread across 146 countries. A count of 164,837 cases has been confirmed of being diagnosed with COVID-19, and a total of 6470 deaths have occurred. The cumulative cases have been depicting a rising trend and the numbers are just increasing. WHO has declared COVID-19 to be a "global health emergency" [1]. 97 article.
Though, nCoV-19 and SARS-CoV are almost genetically similar, the respiratory syndrome caused by both of them, COVID-19 and SARS respectively, are completely different. Studies have indicated that -"SARS was more deadly but much less infectious than COVID-19."-World Health Organization.
The spread of SARS epidemic has provided with any useful insights as during that time the virus epidemic was contained only through general prevention means and treatment of the individual symptoms. As a result, there exists only a limited number of tools available to test the coronavirus for their ability to infect humans. This has acted as a major limitation for predicting the next zoonotic viral outbreak [2].
In response to the current medical crisis, the World Health Organization has activated the R&D Blueprint for the acceleration of the development of diagnostics, therapeutics and vaccines for treatment of this novel coronavirus [1][2][3]. The objective of this study lies in compliance to the guidelines for research activated by WHO. This study aims to utilize a reverse-vaccinology approach in order to identify potential vaccine candidates for COVID19 for the country USA. We shall be deploying open-access bioinformatics tools for our analysis of the same.

Strain Selection
The virulent strain of SARS-CoV-2 was selected for in-silico analysis. The genome of the viral strain is made available by NCBI -National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov.in). The reference identification for SARS-CoV-2 is given by RefSeq NC_045512.2.

Protein Identification and Retrieval
14534 viral protein sequences of the SARS-CoV-2 were obtained from the ViPR -Virus Pathogen Database and Analysis [4]. These protein sequences were identified and downloaded in a tabular format for the Host -Humans and the Country -USA. Table 1 depicts randomly selected four protein sequences obtained out of the 14534 collected under the ViPR databases along with their respective properties.

Physiochemical Property Analysis
The FASTA-file for all the 14535 protein sequences was downloaded and loaded in R using "SeqinR" and "Biostrings" packages. The different physicochemical properties were analysed using the "Peptides" package in the R. [5][6][7].

Protein Antigenicity
VaxiJen 2.0 is used to predict the antigenicity of the protein based on the FASTA-files that contain their respective amino acid sequences. The online software predicts the antigenic score for the same [8][9].

B-Cell and T-Cell Epitope Prediction
The T-cell and B-cell epitopes of the selected two leader protein sequences were predicted using the IEDB (The Immune Epitope Database). IEDB is a freely available resource funded by NIAID. It is a catalog that contains the experimental data on antibody and T cell epitopes studied in humans, non-human primates, and other animal species in the context of the infectious disease, allergy, autoimmunity, and transplantation. It also assists in hosting tools for predicting and analysing the epitopes [10].
For MHC Class-I T-cell epitope prediction for ORF1ab leader proteins, NetMHCpan EL 4.0 method was used. This method was applied for the HLA-A*11-01 allele. For MHC Class-II T-cell epitope prediction for ORF1ab leader proteins, Sturniolo prediction method was used. This method was applied for the HLA DRB1*04-01 allele. For B-cell lymphocyte epitope prediction, Bepipered Prediction Method was used [11]. Ten epitopes were selected on random for analysis of their antigenicity, allergenicity, topology and conservancy.

Prediction of Conservancy of the Epitopes selected
The conservancy of the selected epitopes was analysed using the conservancy analysis tool of the IEDB server. For this analysis, the parameter for the sequence identity threshold was adjusted to '>=50' [10].

3D Structure Generation for Epitopes selected
In order to generate the 3D structure of the epitopes choses, the PEP-FOLD3 tool was used. PEP-FOLD (http://bioserv.rpbs.univ-paris-diderot.r/services/PEP-FOLD3/) uses a de novo approach to predict the peptide sequences from the amino acid sequences [12][13][14]. It utilizes the structural alphabet SA letters to describe the conformations of four consecutive residues.

Molecular Docking of the Epitopes selected
The pre-docking of the selected epitopes was done using the UCSF Chimera. It was also used to perform predocking of the selected alleles HLA-A*11-01 (for MHC Class-I) and HLA DRB1*04-01. Later, the docking of the peptide-protein was done using HPEPDOCK. HPEPDOCK is a web server of performing blind peptide-protein utilizing hierarchical algorithm [15]. Instead of performing length stimulations to refine peptide conformations, HPEPDOCK studies the peptide flexibility through an ensemble of peptide conformations produced by the MODPEP program.

Selection and Retrieval of Potential Vaccine Candidate information
The SARS-CoV-2 strain was identified. All the 14534 protein sequences were analysed on the basis of the physicochemical properties to select the top five candidates for further analysis. For the physicochemical analysis, the number of amino acids, instability index, aliphatic index, and the grand average of hydrophobicity (GRAVY) scores of the all the 14,534 proteins were taken using the Peptides package in R. This package allows the identification, selection and analysis of multiple amino acid sequences in the same FASTA-file. Hence, it was utilized.
The physicochemical study unveiled five potential candidates, as shown in Table 2, having the lowest score of GRAVY and instability index less than 40 (hence, displaying stability). These five candidates were individually analysed for their molar extinction coefficients and antigenicity.

Antigenicity and Extinction Coefficients of the Potential Candidates
The VaxiJen 2.0 tool was used to analyse the antigenicity of the potential five candidates and the molar extinction coefficient was analysed using the ExPASy's online tool -ProtParam. Out of the five potential candidates (depicted in Table 3), only two were selected for further analysis. This was based on the criteria of having the highest score of predicted antigenicity and highest values of the extinction coefficients. These two were: ORF1ab leader protein MT326102 and the ORF1ab leader protein MT326175. They both had the same molar extinction coefficient of 12950 but, different yet high scores of predicted antigenicity -0.4624 and 0.5166 respectively. Both the leader proteins were then used for further analysis.

T-cell and B-cell Epitope Prediction
The T-cell epitopes of the MHC Class-I for both the leader protein were determined using the NetMHCpan EL 4.0 prediction method of the IEDB server keeping the sequence length at 9. This tool allows the generation of the epitopes and sorts them on the basis of their percentile scores. Randomly, ten potential epitopes, depicted in Tables 4 and 6, were selected randomly for the antigenicity, allergenicity, toxicity, and conservancy tests.
For MHC class-II, T-cell epitopes (HLA DRB1*04-01 allele) of the proteins were also determined using the IEDB Analysis tools. The Sturniolo prediction methods were used for the same. Again, ten potential candidates were chosen based on the same criteria as that of MHC Class-I. These potential candidates, along with their physiochemical properties have been depicted in Tables 5 and 7. The B-cell epitopes of the proteins were selected using the Bepipered Linear Epitope Prediction method of the IEDB server.

Topology Identification of the Epitope
The topology of the chosen epitopes was determined using the TMHMM v2.0 server (http://www.cbs.dtu.dk/services/TMHMM/). The potential T-cell epitopes, whose topology, antigenicity, allergenicity, toxicity and conservancy was analysed, for ORF1ab leader proteins MT326102 and MT326175 are depicted in the following tables.

Antigenicity, Allergenicity, Toxicity, and Conservancy Analysis of the Epitopes
On the analysis of the antigenicity, allergenicity, toxicity, and conservancy analysis of the T-cell epitopes, it was found that most of them were antigenic, simultaneously being non-allergenic, non-toxic and higher values of conservancy. From the ten selected MHC Class-I and MHC Class-II T-cell epitopes, one from each category were selected from both the leader proteins. The criteria for being selected were having the higher antigenic scores, nonallergenicity, non-toxicity, and conservancy value above 90%. The selected epitopes were: RSDARTAPH, VQLNNNNNN, LSLPVLQVR, and VQLSLPVLQ. The B-cell epitopes of the ORF1ab leader proteins are displayed in the Figures 1 and 2.

Peptide-Protein Docking using HPEPDOCK Server
HPEPDOCK Server was used to perform docking of the peptide and protein. The purpose of the same was to analyse which of the two selected T-cell MHC Class-I epitope: RSDARTAPH and LSLPVLQVR had the lowest global energy. The epitope having the lowest global energy acts as a better vaccine candidate. The docking was performed against the HLA-A*11-01 allele whose pdb format was obtained through pre-docking using UCSF Chimera.
The global energy for the selected Class-I epitopes were: -182.706 and -191.198 respectively for RSDARTAPH and LSLPVLQVR. Out of the two MHC Class-I epitopes selected for leader proteins ORF1ab MT326102 and MT326175, the global energy was lowest for LSLPVLQVR (as modelled in Figure 4). For the MHC Class-II T-cell epitope, the docking was performed against the HLA DRB1*04-01 allele. Same procedure was followed as mentioned before. The two selected epitopes after the previous analysis were: VQLNNNNNN and VQLSLPVLQ. The global energy for the selected epitopes were -203.369 and -238.196 respectively.
Out of the two T-cell MHC Class-II epitopes selected for the ORF1ab leader proteins MT326102 and MT326175, the lowest global energy was of VQLSLPVLQ (as modelled in Figure 5).

Discussion and Conclusion
The scope of this study involved performing an in-silica analysis of the SARS-CoV-2 viral strain for the country of USA against human host. The ViPR database was used for the same to obtain all the protein sequences. A total of 14534 viral protein sequences were obtained whose extensive physicochemical analysis was done to select a group of two. This extensive analysis was performed using Peptides package in the R Software. It was revealed that the two leader proteins ORF1ab MT326102 and MT326715 had the highest extinction coefficient and the lowest score on the GRAVY. Along with the given parameters, these leader proteins were highly stable and were also antigenic in nature.
The FASTA-formatted files of these selected proteins were taken and analysed to obtain the potential T-cell and Bcell epitopes. The T-cell epitopes of MHC Class-1 and MHC Class-II were analysed on the basis of their scores. Ten randomly selected T-cell epitopes from both the classes were taken for further analysis of allergenicity, toxicity, conservancy scores and antigenicity. Only one epitope from both the classes was selected which possessed a higher conservancy score (more than 90%), was non-toxic, non-allergic and antigenic in nature. The two selected epitopes were then docked against their respective alleles to obtain the global energy scores. The epitopes which displayed the lowest global energy score on docking with the alleles were selected and proposed as successful and potential vaccine candidates.