Variation in the repulsive guidance molecule family in human populations

Abstract Repulsive guidance molecules, RGMA, RGMB, and RGMC, are related proteins discovered independently through different experimental paradigms. They are encoded by single copy genes in mammalian and other vertebrate genomes, and are ~50% identical in amino acid sequence. The importance of RGM actions in human physiology has not been realized, as most research has focused on non‐human models, although mutations in RGMC are the cause of the severe iron storage disorder, juvenile hemochromatosis. Here I show that repositories of human genomic and population genetic data can be used as starting points for discovery and for developing new testable hypotheses about each of these paralogs in human biology and disease susceptibility. Information was extracted, aggregated, and analyzed from the Ensembl and UCSC Genome Browsers, the Exome Aggregation Consortium, the Genotype‐Tissue Expression project portal, the cBio portal for Cancer Genomics, and the National Cancer Institute Genomic Data Commons data site. Results identify extensive variation in gene expression patterns, substantial alternative RNA splicing, and possible missense alterations and other modifications in the coding regions of each of the three genes, with many putative mutations being detected in individuals with different types of cancers. Moreover, selected amino acid substitutions are highly prevalent in the world population, with minor allele frequencies of up to 37% for RGMA and up to 8% for RGMB. These results indicate that protein sequence variation is common in the human RGM family, and raises the possibility that individual variants will have a significant population impact on human physiology and/or disease predisposition.


Introduction
The repulsive guidance molecule (RGM) family consists of three members, RGMA, RGMB, and RGMC (also known as HFE2 and HJV) (Monnier et al. 2002;Kuninger et al. 2004;Niederkofler et al. 2004;Papanikolaou et al. 2004;Samad et al. 2004;Schmidtmer and Engelkamp 2004), that are encoded by single-copy genes in human and other vertebrate genomes (Severyn et al. 2009). The family received its name from a then-novel axonal guidance molecule termed RGM that was characterized in 2002 (Monnier et al. 2002). Subsequent studies identified two related proteins in mammals, termed RGMB and RGMC (Papanikolaou et al. 2004;Samad et al. 2004;Schmidtmer and Engelkamp 2004), and fourth member in teleosts, called RGMD (Corradini et al. 2009;Siebold et al. 2017). The original RGM is now named RGMA (Corradini et al. 2009;Severyn et al. 2009;Siebold et al. 2017). RGMA and RGMB have been shown to be expressed in the central nervous system during development (Schmidtmer and Engelkamp 2004), and their discoveries indicated that they were involved in controlling axonal patterning and neuronal survival (Monnier et al. 2002;Matsunaga et al. 2004;Niederkofler et al. 2004;Rajagopalan et al. 2004;Samad et al. 2004). In contrast, RGMC was initially characterized through its gene, which was found within a locus that was linked to a severe form of an iron storage disease that primarily affects children, termed juvenile hemochromatosis (Papanikolaou et al. 2004). The gene was termed HFE2 after HFE (high iron [chemical symbol Fe]), the initial gene whose mutations were found in hemochromatosis (Papanikolaou et al. 2004). The encoded protein, RGMC, is also called hemojuvelin (HJV), because of its relationship with juvenile hemochromatosis (Papanikolaou et al. 2004). Unlike RGMA and RGMB, RGMC/HFE2/HJV is produced in the liver and in cardiac and skeletal muscle, and not within the nervous system (Kuninger et al. 2004;Papanikolaou et al. 2004;Schmidtmer and Engelkamp 2004).
RGMA, RGMB, and RGMC are glycosylphosphatidylinositol (GPI) -linked cell membrane-associated glycoproteins (Corradini et al. 2009;Severyn et al. 2009;Siebold et al. 2017), and the paralogs share~50% amino acid identity and several structural motifs, including 14 cysteine residues in comparable locations within the three proteins (Corradini et al. 2009;Severyn et al. 2009;Siebold et al. 2017). All three RGMs also appear to undergo a series of similar biosynthetic and processing steps leading to both cell-associated and soluble protein species (Babitt et al. 2005;Samad et al. 2005;Kuninger et al. 2006). All three proteins also interact with members of the bone morphogenetic protein (BMP) family, where they function as co-receptors (Core et al. 2014). BMPs are members of the transforming growth factor-b (TGF-b) super-family, and play key roles in different developmental and cell fate decisions (Hata and Chen 2016;Morikawa et al. 2016;Siebold et al. 2017). BMPs bind as dimers to specific type I and type II serine/threonine kinase receptors, and initiate a protein kinase cascade which culminates in the activation by serine phosphorylation of Smads 1, 5, and 8, signal transducers and transcription factors that regulate the expression of many BMP-dependent target genes (Hata and Chen 2016;Morikawa et al. 2016).
RGM proteins also bind to the cell surface trans-membrane molecule, neogenin (Matsunaga et al. 2004(Matsunaga et al. , 2006Rajagopalan et al. 2004;Kuns-Hashimoto et al. 2008;Yang et al. 2008), a member of the netrin-binding, deleted in colon cancer family, which also includes DCC and UNC5 (Keino-Masu et al. 1996;Leonardo et al. 1997;Mehlen and Mazelin 2003;Bernet and Mehlen 2007). The actions of RGMA in both neuronal guidance and neuronal survival are mediated by neogenin (Matsunaga et al. 2004;Conrad et al. 2007). The other RGM proteins also can bind to neogenin, but it does not appear to play the predominant role in their biological actions Xia et al. 2008;Yang et al. 2008;Corradini et al. 2009;Siebold et al. 2017).
Major recent advances in human genetics and genomics now present distinct opportunities for improving our knowledge of human physiology and disease susceptibility, and for gaining new insights into human variation, human origins, and evolution (Acuna-Hidalgo et al. 2016;Katsanis 2016;Quintana-Murci 2016;Battle et al. 2017;eGTEx Project, 2017). Here I use the RGM family to show how to understand and integrate this information, by accessing publically available genomic and gene expression repositories to examine human RGM genes in detail. Results reveal extensive variation in gene expression patterns, substantial alternative RNA splicing, and a range of possible missense alterations and other modifications in the coding regions of each of the three genes. Taken together, these observations will provide new opportunities to define the dynamics and range of RGM actions in different physiological and pathological contexts, and will serve as a template and guide that can be applied to other gene families in humans and other species.

Databases and analyses
Information on human RGMA, RGMB, and RGMC (HFE2/HJV) loci and genes was obtained from the Ensembl (www.ensemble.org) and UCSC Genome Browsers (https://genome.ucsc.edu), by searching genome assembly, GRCh38, with each gene name. The different classes of transcripts for each gene were also derived from the Ensembl and UCSC browsers. Data on levels of RGMA, RGMB, and RGMC mRNAs in human tissues were extracted from the Genotype-Tissue Expression project (GTEx) portal (Battle et al. 2017) (https://www.gtex portal.org/) by searching the "transcriptome" menu with the name of each gene. Relative levels of specific mRNA isoforms were calculated from primary information within the "exon expression" sub-menu of GTEx. Human RGMA, RGMB, and RGMC protein sequences were isolated from the National Center for Biotechnology Information (NCBI) Consensus CDS Protein Set (https://www. ncbi.nlm.nih.gov/CCDS/). Information on predicted population variation in these three proteins was obtained from the Exome Aggregation Consortium (ExAc) genome browser (http://exac.broadinstitute.org/), by examining the primary data from each gene after it was downloaded as a series of CSV files. ExAc contains results of sequencing the exons of 60,706 individuals (Karczewski et al. 2017). Data on predicted alterations in RGMA, RGMB, and RGMC proteins in different cancers were extracted from the cBio portal for Cancer Genomics (http://www.cb ioportal.org/), which lists gene alterations from 65,690 different individuals from 225 cancer studies (Cerami et al. 2012;Gao et al. 2013), and from the National Cancer Institute Genomic Data Commons data portal (https://portal.gdc.cancer.gov/), which contains analogous information on 32,555 cancer cases.

Topography of human RGM loci
The three single-copy human RGM genes reside on different autosomes. Three other genes are found within the 300 kb segment of chromosome 15q26.1 containing RGMA (Fig. 1A), and the locus is conserved with both mouse and chicken genomes (Severyn et al. 2009). RGMB on human chromosome 5q15 is also a part of a chromosomal region with conserved synteny with mouse and chicken genomes (Severyn et al. 2009), but only two genes, CHD1 and DDX18P4, are found within the 300 kb region depicted in Figure 1B. Of note, the paralogous relationship between adjacent genes on both chromosomes, CHD2 and RGMA on chromosome 15 and CHD1 and RGMB on chromosome 5, and their shared convergent transcriptional orientation indicates that these loci were generated by segmental duplication (Severyn et al. 2009). By contrast with the chromosomal regions of RGMA or RGMB, the RGMC/HFE2/HJV locus on human chromosome 1q21.1 is far more gene dense (Fig. 1C), and contains nine other genes that also are present in the orthologous mouse locus (Severyn et al. 2009).

RGM gene structures and expression patterns in human tissues
The human RGMA gene spans~45 kb of chromosomal DNA and consists of seven exons that are used in the vast majority of transcripts reported within the human Genotype-Tissue Expression project (GTEx) (Battle et al. 2017;eGTEx Project, 2017) (Fig. 2). The five predominant RGMA mRNA isoforms described in GTEx consist of either 3 or 4 exons, and encode one of three very similar RGMA proteins of 434, 450, or 458 amino acids, with all differences being located at the NH 2 -termini of the proteins (Fig. 2B). RGMA mRNAs are expressed in 48 of the 51 different human organs and tissues found in the GTEx portal. The 10 organs and tissues with the highest abundance of RGMA mRNAs include esophagus, colon, skeletal muscle, uterus, tibial nerve, testes, ovary, several brain regions, and adipose tissue (range of expression from 139 to 32 transcripts per kilobase million reads in order (TPM; Fig. 2C). By contrast, glyceraldehyde 3-phosphate dehydrogenase (GAPDH), a typical "control" transcript in gene expression studies was 10-70-times more abundant than RGMA in the organs and tissues examined here (Fig. 2C). The vast majority of RGMA mRNAs found in human tissues according GTEx comprised of isoforms 1 or 2 (~93-99% of all transcripts; Fig. 2D). In contrast, the major RGMA protein in the Exome Aggregation Consortium (ExAC) gene dataset is predicted to have 458 amino acids, which is encoded by isoform 5 in Figure 2B. This mRNA is expressed minimally in the ten human tissues catalogued by GTEx and presented here (Fig. 2D, and see below).
The human RGMB gene is slightly more compact than RGMA, and its five major exons extend over~26 kb of genomic DNA (Fig. 3A). The three predominant transcripts in GTEx also are derived by alternative RNA splicing, but only two of these mRNAs appear to encode RGMB proteins of either 478 or 437 amino acids ( Fig. 3B). RGMB transcripts are expressed in 49 of the 51 different organs and tissues found in GTEx; except for esophagus, mRNA levels are 2-3-fold lower than for RGMA mRNAs (compare Figs. 3C, 2C). The majority of expressed RGMB mRNAs encode RGMB proteins, primarily the 478 amino acid species (isoform 1, Fig. 3D).
Human RGMC/HFE2/HJV at~4.5 kb in length is substantially smaller than either RGMA or RGMB, and is composed of four exons and three introns (Fig. 4A). There are five major transcripts expressed in human organs and tissues, and they encode proteins of variable lengths, from 93 to 426 amino acids (Fig. 4B). Unlike RGMA or RGMB, RGMC/HFE2/HJV mRNAs can be detected only in human skeletal muscle, liver, and heart, and were found at steady-state levels that were 9-25-fold less abundant than GAPDH (Fig. 4C). Perhaps surprisingly, the RGMC/HFE2/HJV mRNA that encodes the fulllength 426-residue RGMC/HJV protein (isoform 2) comprises only 10-20% of transcripts in human organs and tissues according to GTEx (Fig. 4D). The reasons for the low level of gene expression for isoform 2 are unknown, but could reflect differential RNA stability, or the technical conditions under which the tissues were obtained and RNA samples isolated and processed (see Discussion).

Predicted variation in RGM proteins in human populations
ExAC contains DNA sequence information from the exons of genes from 60,706 people representing different population groups from around the world (Bahcall 2016;Lek et al. 2016;Ruderfer et al. 2016;Karczewski et al. 2017). The data have revealed substantial variation within the coding regions of genes in this large population, but also showed that most alterations were uncommon, as the majority was detected in a single allele, and over 99% were found in <1% of the study group ). Most of this previously described variation consists of synonymous nucleotide changes and amino acid substitutions .
Examination of RGM family members in ExAC revealed a wide range of potential alterations in their exons, with most of the predicted changes consisting of missense mutations (92-96% of modified alleles, depending on the gene, Table 1). Second most common were changes in the reading frame, including inserted stop codons (1-7%, Table 1). Overall, the total number of different allelic variants per gene was similar for all RGM family members, and ranged from 143 for RGMB to 185 for RGMA, but their population frequency varied by a factor of 60, from 1.4% for RGMC/HFE2/HJV to 86% for RGMA, with the vast majority of changes being accounted for just a few modifications (Fig. 5). As 99.1% of missense alleles were detected in ≤1% of the ExAC study population, overall results regarding the frequency of differences in the human RGM family proteins are consistent with the general conclusions from ExAC , with the exception of the few highly prevalent allelic variants depicted in Figure 5.

Population variation in RGMA
Alterations in RGMA have not been linked to date with the pathogenesis of any specific human diseases. Thus, the functional consequences of three prevalent specific amino acid substitutions in the RGMA protein (Leu 4 to Pro in the signal peptide (8.5% in the population), the conservative substitution of Asp 423 to Glu in the C-terminal RGM domain (63.1%), and Ala 439 to Val in the GPI-anchor segment (11.9%), Fig. 5A) in either human physiology or pathology are not known.
As with some other proteins, a large number of alterations in RGMA have been found to be associated with a variety of different cancers, according to the analysis of data in the cBio portal for Cancer Genomics (Table 2) and the National Cancer Institute Genomic Data Exons are represented as boxes, with coding regions in black and non-coding segments in white, and introns as horizontal lines. A scale bar is shown. (B) Diagrams of the four major classes of human RGMA mRNAs represent the following transcripts from the Ensembl genome browser: isoform 1, ENST00000543599.5; isoform 2, ENST00000329082.11; isoform 3, ENST00000542321.6; and isoform 4, ENST00000425933.6. The protein encoded by each transcript is listed to the right of each diagram. (C) RGMA gene expression in 10 different human tissues and organs. Data were obtained from the GTEx portal, and are graphed as the mean number of transcripts per kilobase million reads (TPM), with the mean transcript abundance of glyceraldehyde 3-phosphate dehydrogenase (GAPDH) listed to the right of each RGMA RNA level. The number of samples for each organ and tissue is as follows: esophagus (370), sigmoid colon (233), skeletal muscle (564), uterus (111), tibial nerve (414), testes (259), ovary (133), substantia nigra (88), cerebral cortex (158)   Commons portal, although the functional consequences are unknown. Potential mutations at 78 different locations in RGMA coding exons have been detected in 38 different neoplasms, with the prevalence of these changes ranging from 3.5% in ovarian cancer and 2.8% in esophageal, gastric, and small cell lung cancer, and in soft tissue sarcoma, to <0.3% in prostate and renal carcinoma, various leukemias and lymphomas, and others (see cancer   (Table 2). However, three of the cancer-associated amino acid substitutions were among the more common allelic variants in the population (Leu 4 to Pro, 8.5% of ExAC alleles, Ala 439 to Val, 11.9%, and Arg 441 to Trp, 0.6%; Fig. 5A), and thus may have been detected by chance rather than through disease association. However, the most prevalent allele in ExAC, Asp 423 to Glu, seen in 63.1% of the population (Fig. 5A), was absent in any of the cancer studies compiled here (Table 2). Other changes associated with different neoplasms included premature stop codons and frame-shifts, none of which were found in ExAC (seven examples, Table 2).

Population variation in RGMB
Changes in RGMB also have not been connected to the pathogenesis of any human diseases to date. As with RGMA, the functional consequences to human physiology or pathology of the single predicted single amino acid substitution in RGMB that is prevalent in the ExAC population (Ser 63 to Arg in the signal peptide (7.8%), Fig. 5B) are unknown. Changes in RGMB also have been detected in a number of different cancers (Table 3), but as with RGMA, the possible functional impacts are not known. Potential mutations have been identified at 69 different locations in coding portions of the RGMB gene in 38 cancer studies, with the prevalence of these changes ranging from nearly 10% in prostate cancer and 7.5% in adrenocortical carcinoma to 0.3% or less in cervical, thyroid, bone, skin, and brain cancers, in leukemias and lymphomas, and in other neoplasms (see cancer type in: http://www.cbioportal.org/ index.do?session_id=5b609381498eb8b3d5672df4). Most of the alterations consisted of amino acid substitutions (73 different modifications; Table 3), of which 26 at 19 sites were identified in ExAC at a frequency of 0.1-0.001% (Table 3), except for the highly prevalent Ser 63 to Arg allele at 7.8% (Fig. 5B). The other 18 changes, which included both premature stop codons and frame-shifts, which led to stop codons, were not found in ExAC (Table 3).

Disease links and population variation in RGMC/HFE2/HJV
Unlike other members of the human RGM family, RGMC/ HFE2/HJV was first characterized as the gene associated with the severe iron storage disease, juvenile hemochromatosis (Papanikolaou et al. 2004), and identification of mutations in the gene in affected individuals defined causality (Lanzara et al. 2004;Papanikolaou et al. 2004;Gehrke et al. 2005), which was confirmed by mouse gene knockout models (Huang et al. 2005;Niederkofler et al. 2005). The majority of over 40 different mutations that have been found in the individuals with juvenile hemochromatosis are amino acid substitutions, but more than a third predict truncated proteins because of introduced premature stop codons (Table 4). Almost half of these disease-associated alleles can be found in the ExAC population, but nearly all are present at very low prevalences of 0.025-0.001% (Table 4). The only exception, Ala 310 to Gly, is the most common RGMC/HFE2/HJV variant in ExAC, and has a population frequency of 0.7% (Fig. 5C).
Potential alterations in RGMC/HFE2/HJV also are present in different cancers, but as with RGMA and RGMB, the possible functional consequences have not been determined. Predicted mutations (116, Table 5) have been identified at 102 different codons in 38 different neoplastic diseases, with the prevalence of these alterations ranging from 25% in prostate cancer, 10% in ovarian cancer, and 8.4% in melanoma, to 0.6% or less in colorectal carcinoma, salivary gland and renal cancer, leukemia, lymphomas, and others (see http://www.cbioportal.org/ index.do?session_id=5b60fc90498eb8b3d5672fba). Putative amino acid substitutions or deletions predominated (106 different modifications at 92 locations; Table 5). Only 11 of these alterations were present in ExAC, with 9 having allelic frequencies of <0.002% (Table 5), and the others, a deletion or a duplication of Gly 69 , at 0.06 or 0.13%, respectively ( Table 5). The other 10 changes consisted of premature stop codons and frame-shifts, and except for Arg 385 to stop codon were not found in ExAC (Table 5).

Discussion
Information extracted from publically available databases has been collected and then analyzed here to gain insights into the genomics and population genetics of the RGM family in humans. Results identify extensive variation in gene expression patterns, substantial alternative RNA splicing, and a range of possible missense alterations and other modifications in the coding regions of each of the three genes studied, which were not apparent previously, and in many cases are detected in individuals with different types of cancers (Tables 2, 3, 5). In addition, the data show that selected amino acid substitutions are highly prevalent in the world's population, with minor allele frequencies of up to 37% for RGMA and up to 8% for RGMB (Fig. 5). Collectively, these results indicate that protein sequence variation is common in the human RGM family, as has been observed for some other human proteins (Rotwein 2017a,b), and it thus appears likely that these variants could have a significant population impact on human physiology and/or disease predisposition.

RGMA and RGMB: genes, mRNAs, and proteins
By combining information from the Ensembl and UCSC Genome Browsers with data extracted from GTEx, complex patterns of expression have been elucidated here for each human RGM gene, particularly in the distribution of different mRNA isoforms (Figs. 2-4). For example, these results now demonstrate that both RGMA and RGMB genes are widely expressed in many different adult human organs and tissues, with most of the transcripts encoding one of the several "full-length" proteins, as differences among these isoforms are found primarily at the NH 2 -terminus in the presumptive signal peptides (Figs. 2, 3). Although a few studies have examined possible effects of RGMA or RGMB in humans (Demicheva et al. 2015;Shi et al. 2015;Li et al. 2016;Muller et al. 2016), most publications to date have focused on experimental model systems (Matsunaga et al. 2004(Matsunaga et al. , 2006Niederkofler et al. 2004;Rajagopalan et al. 2004;Samad et al. 2004;Hata et al. 2006;Tanabe and Yamashita 2014). Thus, these new observations will provide opportunities to develop new insights into RGMA and RGMB gene regulation and their protein functions in a variety of human physiological and pathological processes. Of particular note here is the fact that according to GTEx both RGMA and RGMB are expressed at similarly high transcript levels in the muscularis region of the esophagus, and within the gastro-esophageal junction (Figs. 2C, 3C, and not shown), raising the question of whether either or both proteins might be None None involved in aspects of smooth muscle function, such as its coordination by the sympathetic and parasympathetic nervous systems or other signals during swallowing or digestion of food (Woodland et al. 2013). As mRNAs encoding neogenin (NEO1) and BMP receptors (BMPR1A, BMPR1B, and BMPR2) also are expressed in these parts of the esophagus, it is conceivable that different RGM-mediated signaling pathways could be active in different parts of this organ. Another surprising observation with regard to RGMA and RGMB is their expression in a range of different cancers, with transcripts encoding mutant proteins being detected in up to 10% of cases of prostate cancer (RGMB) and in 3.5% of ovarian carcinomas (RGMA, see Results), again providing evidence for their unexplored roles in human disease. As the majority of these predicted mutations were found to be rare in the general population used in ExAC (although nearly all of the most highly prevalent amino acid substitution alleles were present; see Tables 2 and 3), these data argue for possible pathophysiological actions for RGMA and RGMB in human neoplasms, and represent another illustration in which focused analysis of information extracted from large-scale databases can help identify new areas of investigation with possible biomedical consequences.
The special case of RGMC/HFE2/HJV Data collected and assessed from Ensembl, the UCSC Genome Browser, and GTEx also have revealed some unexpected aspects of human RGMC/HFE2/HJV gene expression (Fig. 4). Even though restriction of transcripts None R96Q, R96stop 132, 1 alleles S104R None None -1 Amino acid positions modified to agree with ExAC assignments (see Text). to skeletal muscle, liver, and heart had been recognized previously (Kuninger et al. 2004;Papanikolaou et al. 2004;Schmidtmer and Engelkamp 2004), remarkably it now appears that only~20% of RGMC/HFE2/HJV mRNAs found in human tissues encode the 426-amino acid full-length protein (Fig. 4D). The other mRNAs, which comprise the vast majority of transcripts in each tissue type (80 to 90%, Fig. 4D), encode proteins that are truncated at the NH 2 -terminus. These latter species lack most of the N-RGM domain (313-residue isoform), all of the N-RGM segment and the entire von Willebrand factor type D domain (200-amino acid protein), or all but 93amino acids in the center of the molecule (Figs. 4B, 5C). The observations also raise questions regarding which of these variant RGMC/HJV proteins are biologically active molecules, and what are their presumptive activities. In animal and cell-based studies, several different-length versions of RGMC/HJV have been noted, but these have been characterized as being derived from differential protein processing during biosynthesis, and from proteolytic cleavage of the mature GPI-linked cell surface molecule either by pro-protein convertases such as furin (Kuninger et al. 2006Silvestri et al. 2008a), or by the serine protease, matriptase-2 (Silvestri et al. 2008b). Thus, these new observations, which have resulted from analyses of information in databases, define a potentially novel and alternative way that different RGMC/HJV protein isoforms are produced in humans.
Unlike what is observed for RGMA and RGMB, presumptive RGMC/HJV protein variants within the ExAC population are very uncommon, collectively occurring in <1.5% of 60,706 genomes versus 86% for RGMA and 9% for RGMB (Fig. 5). Moreover, even though 17 of 43 amino acid substitution, frame-shift, and stop codon mutations associated with juvenile hemochromatosis have been found in the ExAC study cohort, only a single disease-associated allele is present in more than 0.025% of the population (Ala 310 to Gly, at~0.7%), and 13 are represented just 1-3 times in the 121,412 ExAC alleles (Table 4). This result suggests that any possible contribution of RGMC/HFE2/HJV heterozygosity toward iron overload in the general population is minimal, in marked contrast to the high prevalence of HFE protein variants, at least in European-derived groups (Barton et al. 2015;Wallace and Subramaniam 2016).
As seen for RGMA and RGMB, predicted mutations of RGMC/HJV are found in many different cancers, with transcripts encoding mutant proteins being detected in 25% of prostate cancers, 10% of ovarian carcinomas, and 8.4% of melanomas (see Results). Remarkably, both prostate and ovarian cancers are the diseases in which mutant RGMB and RGMA molecules also have been found at highest prevalence, respectively (see Results and above). Moreover, only~10% of the 106 different mutations in RGMC/HJV detected in cancers are present in ExAC, with all but one of them being rare (found fewer than 5 times) in the 121,412 alleles studied (Table 5).
Limitations and strengths of populationbased sequence data for understanding RGM actions As with any large-scale DNA or RNA-based sequencing project, ExAC and GTEx respectively contain the None G69del, G69dup 76, 154 alleles Y86S None None - None - None - potential materials for new biological and biomedical applications, as well as errors and ambiguities. From the perspective of the three RGM family genes, potential problems include the choice of minor transcripts as the reference sequences for proteins. This is especially true for RGMA, in which the mRNA species encoding the 458-amino acid protein isoform selected by ExAC (see Table 1) appears to comprise ≤2% of transcripts in human organs and tissues in GTEx (isoform 5, Fig. 2D). In contrast, for RGMB, the predominant transcript in 9 of the 10 tissues surveyed in GTEx encodes the major 478-residue protein species (all but testes, Fig. 3D). Another complication here is the potential variation in RNA quality in GTEx samples, especially since both the time from tissue harvesting to RNA extraction and the methods employed to isolate RNA are unknown. It thus seems possible that transcript degradation may skew the results seen in GTEx RNA-sequencing libraries derived from at least some of the different organs and tissues. Furthermore, as the population distribution of the GTEx dataset is unknown, there are no data to determine whether or not expression of different mRNA isoforms varies among different groups, perhaps in conjunction with population-specific DNA polymorphisms (Khera et al. 2018;Yengo et al. 2018). Other limitations that could contribute to problems in data interpretation include the potential non-representative nature of the ExAC study population, as over 60% of samples are derived from European individuals, with~20% from South or East Asians, and only~8% each from Hispanic or African groups ). Thus, the actual rate and potential extent of variation among RGM proteins has not been established fully yet, and could change once exome sequencing data are obtained from more individuals and are expanded to include larger numbers of people from different human population groups. Moreover, there is an undefined but probable error rate associated with nucleotide changes that appear only once or just a few times in the 121,412 ExAC chromosomes studied. Despite these challenges and difficulties, the data in ExAC, GTEx, and in the various cancer medicine portals examined here, provide potentially exciting new opportunities to evaluate contributes of the RGM family, and RGMA and RGMB in particular, to human physiology and disease. Since RGMA and RGMB are expressed in the vast majority of adult human organs and tissues (48 of 51 for RGMA and 49 of 51 for RGMB), the encoded proteins are likely to be involved in some regulatory processes. Perhaps immune cell function is in one of these areas, since RGMA is expressed in dendritic cells and neogenin is found in CD4 + T lymphocytes (Muramatsu et al. 2011).
Modern human populations represent the outcomes of many interactions over long time frames with different ancestral groups. Not only do the DNA marks in our genomes derived from extinct populations such as Neanderthals, Denisovans, and others document these past relationships (Jones et al. 2015;Vattathil and Akey 2015;Clarkson et al. 2017;Hublin et al. 2017), but some of the introgressed DNA continues to influence human physiology or disease susceptibility to the present day (Dannemann and Kelso 2017; Prufer et al. 2017). Opportunities abound to use the data in ExAC, GTEx, and other largescale population-based repositories such as the British Biobank (Khera et al. 2018;Yengo et al. 2018) as the springboard toward developing novel and medically important research questions with high biological and biomedical significance.