What do we study?
1.1 What was the central aim of this study? The aim of this study was to conduct the first systematic review of the undertaking of Genome-Wide Association Studies (GWAS, for definition see FAQ 1.5) which identify the genetic loci that distinguish us from one another. We consider the scientometrics (see FAQ 1.3) of GWAS, including who the researchers are, who funds them and who (and what) they study. Since the human genome was sequenced, thousands of papers and accessions have identified multiple genetic loci related to diseases such as type 2 diabetes and Alzheimer’s, psychiatric disorders like schizophrenia or autism, physical traits such as height and BMI and behavioural and psychological characteristics such as neuroticism.
1.2 What was the motivation of this study? We noticed some striking aspects about the data being used and the way in which research was conducted. For example, there were geographical concentrations across only a few countries and the data itself was highly selective. Although we had some hunches, we found that there was no systematic or empirical analysis to see if this was the case. We were surprised to find what we did, despite our initial prior hypotheses.
1.3 What is 'scientometrics'? Scientometrics broadly refers to the study of measuring and analysing science, technology and innovation. The main goal is often to map scientific fields and take stock of the research to date. The methods we employ are primarily computational, but we also include qualitative manual coding of the primary data sources used to avoid false positives arising from any potential text mining exercises. Scientometrics has gained importance due to the interest of governments, funders and research bodies in assessing the impact of the research which they have funded. A prominent individual indicator of scientific prowess is the h-index (a metric based on an individual's most cited papers), often used to gauge the productivity and impact of scientists. For example, an h-index of 50, would mean that the author had 50 papers that have all been cited at least 50 times. As described in FAQ 2.2, we develop a new measure in the form of a 'GWAS h-index'. This allows us to determine the extent to which participating in GWAS research contributes to a scientists impact in this specific field.
1.4 What are the scientific benefits of diversity in genetic research? GWAS which utilize data from diverse populations will provide more accurately targeted therapeutic treatments to more of the world’s population, extend insights in the allelic architecture of traits and uncover rare variants with significant effect sizes which replicate across ancestries. A recent example of this, for instance, was a GWAS on Greenlandic Inuit which found selected alleles with large effect sizes for height: a finding which replicates in Europeans but was hitherto undetected due to infrequent genetic occurrence across commonly utilized samples. Increasing the number of different combinations of DNA which are studied can have huge scientific benefits, and recent genetic sequencing of Khoi-San bushmen showed that even two people from adjacent villages were as different from one another as any two European or non-African ancestry individual.
1.5 What is a Genome-Wide Association Study (GWAS)? A Genome-Wide Association Study or GWAS (pronounced 'Gee-WAS') is a search across the entire human genome, examining each genetic locus (or region) one by one to see if there is a statistical relationship (an 'association') between 'traits'. A trait is often interchangeably called a phenotype or outcome. The genetic loci that are identified contain SNPs (pronounced SNIPs: single-nucleotide polymorphisms), or in other words, the genetic variants that distinguish us from each other. Humans are 99.9% identical to each other (about one SNP per 1000 bases), and it is the 0.1% by which we differ that makes us all genetically unique. Only in the case of 'Mendelian diseases' like Huntington's disease is there only one gene that is at play. The majority of traits which we are interested in are complex and are the result of multiple genetic loci combining (referred to as 'polygenic traits'), where often hundreds and thousands of genetic variants have a small influence on a phenotype.
1.6 How many genetic discoveries have been made to date? Since the publication of the first GWAS in March 2005 up until December 31st 2018, there have been over 3,700 papers indexed by the NHGRI-EBI GWAS Catalog (hereafter the 'Catalog') produced by the US National Human Genome Research Institute (NHGRI), comprising over 6000 different study accessions. Figure 1 in the article shows the incredible growth over time. As the number of people (and the size of individual samples) studied grew, so too did the number of diseases studied, the number of associations found and the number of different journals which publish such work.
1.7 What did we already know about the undertaking of genetic discovery before this study? Why is it important to examine this topic? There have been some excellent scientific reviews by leading geneticists and genetic ethicists in the field, but these were narrative and more subjective. Although genetic discoveries require substantial funding and are increasingly used for pharmaceutical development, we found it striking that no systematic review had taken place regarding the broader undertaking of such research. There were a few recent studies that noted that this research was largely based on European-ancestry, but lacked a deeper analytical aspect. For example, none of the previous studies distinguished between 'discovery' and 'replication' samples. When we analysed this, we found that when non-European ancestry groups were examined, it was actually largely in replication to test whether the genetic loci found in European ancestry populations were the same in other groups (i.e. not discovering new loci). The focus of the research to date had almost exclusively looked at ancestry and rarely examined other types of diversity. We therefore found it important to extend the discussion beyond ancestry.
1.8 What is the difference between ancestry as defined in genetics and race? There is often considerable confusion and misinterpretation of the terms 'ancestry' and 'race' in human genetics research, and these terms are not interchangeable. Genetic variation needs to be distinguished from the social, cultural and political meanings as ascribed to different human groups. Race is not a biological category since genetic variation is traced to geographical locations and does not map into the perpetually evolving and socially defined racial or ethnic groups. Populations are the product of admixture across tens of thousands of years, and the concentration of genetic alleles in some groups is thus related to where they have descended from and has nothing to do with the social categorisation of race.
1.9 How are the findings different from other similar studies? Previous studies suggested that particularly ancestral diversity was becoming more equal but we show that this is not the case by examining change over time (see Table 1 main article). No-one at previously looked at the geographical location of the people studied. It was remarkable that 72% came from merely 3 countries, US, UK and Iceland (see FAQ 3.2 for why this is a problem). The data used was also very selective and not representative (see FAQ 3.3 for why this is a problem). The network of researchers themselves had never been studied, we showed a tight knit group of researchers that often held key datasets but also strong gender imbalance with 70% of senior last author positions held by men.
How did we study it?
2.1 What kinds of data did we use? We examined over thirteen years of GWAS discoveries (March 2005 to October 2018) from the Catalog. We then linked the Catalog to an external database called PubMed, which has information about the researchers themselves, the journal where it was published and funding acknowledgements. We also compared the GWAS samples to world population data drawing from the United Nations (UN) population data.
2.2 What kinds of analytical methods did we use? We used a primarily automated scientometric approach (see FAQ 1.3) but also manually extracted information on the most commonly used datasets with the help of three diligent research assistants. This covered over 85% of all GWAS by cumulative sample size, giving us reasonable confidence in which to draw our conclusions. We rank and map top funders by ancestry and disease, isolate key consortiums, engage in an analysis of gender and authorship, create a unique GWAS h-index and undertake a social network analysis of author centrality.
2.3 What are the strengths of our approach? We are the first to engage in a:
2.4 What are the limitations of our approach? There are numerous limitations to our approach, including the fact that we:
What are some of the central findings?
What are the medical, societal, scientific and commercial implications of this study? Our study is crucial for researchers, data providers, editors and consortiums working in this area to understand the strengths and potential gaps in current research and is essential to plan future investments in data collection and science policy for funders, research bodies and national governments. Growing investments from pharmaceutical companies are being made in this area to translate this research into drugs. Our study highlights that the genetic findings to date are of highly selective populations, coming largely from 3 countries and are often not representative of the wider global population. We offer ten concrete evidence-based policy recommendations:
Additional general questions (updated over time as we receive additional questions)