in-silica Analysis of SARS-CoV-2 viral strain using Reverse Vaccinology Approach: A Case Study for USA

The recent pandemic of COVID19 that has struck the world is yet to be battled by a potential cure. Countless lives have been claimed due to the existing pandemic and the societal normalcy has been damaged permanently. As a result, it becomes crucial for academic researchers in the field of bioinformatics to combat the existing pandemic. The study involved collecting the virulent strain sequence of SARS-nCoV19 for the country USA against human host through publically available bioinformatics databases. Using in-silica analysis and reverse vaccinology, two leader proteins were identified to be potential vaccine candidates for development of a multi-epitope drug. The results of this study can provide further researchers better aspects and direction on developing vaccine and immune responses against COVID19. This work also aims at promoting the use of existing bioinformatics tools to faster streamline the pipeline of vaccine development. The Situation of COVID19 A new infection respiratory disease was first observed in the month of December 2019, in Wuhan, situated in the Hubei province, China. Studies have indicated that the reason of this disease was the emergence of a genetically-novel coronavirus closely related to SARS-CoV. This coronavirus, now named as nCoV-19, is the reason behind the spread of this fatal respiratory disease, now named as COVID-19. The initial group of infections is supposedly linked with the Huanan seafood market, most likely due to animal contact. Eventually, human-to-human interaction occurred and resulted in the transmission of the virus to humans. [13]. Since then, nCoV-19 has been rapidly spreading within China and other parts of World. At the time of writing this article (mid-March 2020), COVID-19 has spread across 146 countries. A count of 164,837 cases have been confirmed of being diagnosed with COVID-19, and a total of 6470 deaths have occurred. The cumulative cases have been depicting a rising trend and the numbers are just increasing. WHO has declared COVID-19 to be a “global health emergency”. [14]. Current Scenario and Objectives Currently, research is being conducted on a massive level to understand the immunology and genetic characteristics of the disease. However, no cure or vaccine of nCoV-19 has been developed at the time of writing this article. Though, nCoV-19 and SARS-CoV are almost genetically similar, the respiratory syndrome caused by both of them, COVID-19 and SARS respectively, are completely different. Studies have indicated that – “SARS was more deadly but much less infectious than COVID-19”. -World Health Organization


Current Scenario and Objectives
Currently, research is being conducted on a massive level to understand the immunology and genetic characteristics of the disease. However, no cure or vaccine of nCoV-19 has been developed at the time of writing this article.
Though, nCoV-19 and SARS-CoV are almost genetically similar, the respiratory syndrome caused by both of them, COVID-19 and SARS respectively, are completely different. Studies have indicated that -"SARS was more deadly but much less infectious than COVID-19."

-World Health Organization
The spread of SARS epidemic has provided with any useful insights as during that time the virus epidemic was contained only through general prevention means and treatment of the individual symptoms. As a result, there exists only a limited number of tools available to test the coronavirus for their ability to infect humans. This has acted as a major limitation for predicting the next zoonotic viral outbreak. [13][14][15].
In response to the current medical crisis, the World Health Organization has activated the R&D Blueprint for the acceleration of the development of diagnostics, therapeutics and vaccines for treatment of this novel coronavirus.
The objective of this study lies in compliance to the guidelines for research activated by WHO. This study aims to utilize a reverse-vaccinology approach in order to identify potential vaccine candidates for COVID19 for the country USA. We shall be deploying open-access bioinformatics tools for our analysis of the same.

Strain Selection
The virulent strain of SARS-CoV-2 was selected for in-silico analysis. The genome of the viral strain is made available by NCBI -National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov.in).
The reference identification for SARS-CoV-2 is given by RefSeq NC_045512.2

Protein Identification and Retrieval
14534 viral protein sequences of the SARS-CoV-2 were obtained from the ViPR -Virus Pathogen Database and Analysis. [1]. These protein sequences were identified and downloaded in a tabular format for the Host -Humans and the Country -USA.

Physiochemical Property Analysis
The FASTA-file for all the 14535 protein sequences was downloaded and loaded in R using "SeqinR" and "Biostrings" packages. The different physicochemical properties were analysed using the "Peptides" package in the R. [2][3][4].

Protein Antigenicity
VaxiJen 2.0 is used to predict the antigenicity of the protein based on the FASTA-files that contain their respective amino acid sequences. The online software predicts the antigenic score for the same. [5][6].

B-Cell and T-Cell Epitope Prediction
The T-cell and B-cell epitopes of the selected two leader protein sequences were predicted using the IEDB (The Immune Epitope Database). IEDB is a freely available resource funded by NIAID. It is a catalog that contains the experimental data on antibody and T cell epitopes studied in humans, non-human primates, and other animal species in the context of the infectious disease, allergy, autoimmunity, and transplantation. It also assists in hosting tools for predicting and analysing the epitopes. [7].
For MHC Class-I T-cell epitope prediction for ORF1ab leader proteins, NetMHCpan EL 4.0 method was used. This method was applied for the HLA-A*11-01 allele. For MHC Class-II T-cell epitope prediction for ORF1ab leader proteins, Sturniolo prediction method was used. This method was applied for the HLA DRB1*04-01 allele. For B-cell lymphocyte epitope prediction, Bepipered Prediction Method was used. [8].
Ten epitopes were selected on random for analysis of their antigenicity, allergenicity, topology and conservancy.

Prediction of Conservancy of the Epitopes selected
The conservancy of the selected epitopes was analysed using the conservancy analysis tool of the IEDB server. For this analysis, the parameter for the sequence identity threshold was adjusted to '>=50'. [7].

3D Structure Generation for Epitopes selected
In order to generate the 3D structure of the epitopes choses, the PEP-FOLD3 tool was used. PEP-FOLD (http://bioserv.rpbs.univ-paris-diderot.r/services/PEP-FOLD3/) uses a de novo approach to predict the peptide sequences from the amino acid sequences. [9][10][11]. It utilizes the structural alphabet SA letters to describe the conformations of four consecutive residues.

Molecular Docking of the Epitopes selected
The pre-docking of the selected epitopes was done using the UCSF Chimera. It was also used to perform pre-docking of the selected alleles HLA-A*11-01 (for MHC Class-I) and HLA DRB1*04-01.
Later, the docking of the peptide-protein was done using HPEPDOCK. HPEPDOCK is a web server of performing blind peptide-protein utilizing hierarchical algorithm. [12]. Instead of performing length stimulations to refine peptide conformations, HPEPDOCK studies the peptide flexibility through an ensemble of peptide conformations produced by the MODPEP program.

Selection and Retrieval of Potential Vaccine Candidate information
The SARS-CoV-2 strain was identified. All the 14534 protein sequences were analysed on the basis of the physicochemical properties to select the top five candidates for further analysis.
For the physicochemical analysis, the number of amino acids, instability index, aliphatic index, and the grand average of hydrophobicity (GRAVY) scores of the all the 14,534 proteins were taken using the Peptides package in R. This packages allows the identification, selection and analysis of multiple amino acid sequences in the same FASTA-file. Hence, it was utilized.  The physicochemical study unveiled five potential candidates having the lowest score of GRAVY and instability index less than 40 (hence, displaying stability). These five candidates were individually analysed for their molar extinction coefficients and antigenicity.

Antigenicity and Extinction Coefficients of the Potential Candidates
The VaxiJen 2.0 tool was used to analyse the antigenicity of the potential five candidates and the molar extinction coefficient was analysed using the ExPASy's online tool -ProtParam.
Out of the five potential candidates, only two were selected for further analysis. This was based on the criteria of having the highest score of predicted antigenicity and highest values of the extinction coefficients.  Both the leader proteins were then used for further analysis.

T-cell and B-cell Epitope Prediction
The T-cell epitopes of the MHC Class-I for both the leader protein were determined using the NetMHCpan EL 4.0 prediction method of the IEDB server keeping the sequence length at 9. This tool allows the generation of the epitopes and sorts them on the basis of their percentile scores. Randomly, ten potential epitopes were selected randomly for the antigenicity, allergenicity, toxicity, and conservancy tests.
For MHC class-II, T-cell epitopes (HLA DRB1*04-01 allele) of the proteins were also determined using the IEDB Analysis tools. The Sturniolo prediction methods were used for the same. Again, ten potential candidates were chosen based on the same criteria as that of MHC Class-I.
The B-cell epitopes of the proteins were selected using the Bepipered Linear Epitope Prediction method of the IEDB server.

Topology Identification of the Epitope
The topology of the chosen epitopes was determined using the TMHMM v2.0 server (http://www.cbs.dtu.dk/services/TMHMM/). The potential T-cell epitopes, whose topology, antigenicity, allergenicity, toxicity and conservancy was analysed, for ORF1ab leader proteins MT326102 and MT326175 are depicted in the following tables.  Table 4

. List of MHC Class-I epitopes for selected leader protein ORF1ab MT326102 along with their respective antigenicity score, allergenicity, topology, conservancy and toxicity measure. The epitope in bold italics represented the chosen epitope for 3D
Generation.

Antigenicity, Allergenicity, Toxicity, and Conservancy Analysis of the Epitopes
On the analysis of the antigenicity, allergenicity, toxicity, and conservancy analysis of the Tcell epitopes, it was found that most of them were antigenic, simultaneously being nonallergenic, non-toxic and higher values of conservancy. From the ten selected MHC Class-I and MHC Class-II T-cell epitopes, one from each category were selected from both the leader proteins. The criteria for being selected was having the higher antigenic scores, nonallergenicity, non-toxicity, and conservancy value above 90%.
The B-cell epitopes of the ORF1ab leader proteins are displayed in the following two figures.

Generation of the 3D Structures of Epitopes
The following figures depict the PEP-FOLD3 generated 3D structures of the selected T-cell epitopes of MHC Class-I and Class-II: RSDARTAPH, VQLNNNNNN, LSLPVLQVR, and VQLSLPVLQ.

Figure 3. (a-d) 3D structure generation of the selected epitopes. (a) Selected T-cell MHC Class-I epitope for ORF1ab leader protein MT326102. (b) Selected T-cell MHC Class-II epitope for ORF1ab leader protein MT326102. (c) Selected T-cell MHC Class-I epitope for ORF1ab leader protein MT326175. (d) Selected T-cell MHC Class-II epitope for ORF1ab
leader protein MT326175.