Prof. Melinda C. Mills

melinda.mills [at]

© 2023 Sociogenome

Frequently Asked Questions:

A Scientometric Review of Genome-Wide Association Studies

Journal: Communications Biology

DOI: 10.1038/s42003-018-0261-x

Authors: M.C. Mills & C. Rahal
Affiliation: University of Oxford and Nuffield College
Correspondence: Professor Melinda Mills, Nuffield College, 1 New Road, Oxford, OX1 1NF, UK.
This FAQ provides a range of accessible answers to some of the questions you might have regarding our article.


  1. What do we study?

    • 1.1 What was the central aim of this study? The aim of this study was to conduct the first systematic review of the undertaking of Genome-Wide Association Studies (GWAS, for definition see FAQ 1.5) which identify the genetic loci that distinguish us from one another. We consider the scientometrics (see FAQ 1.3) of GWAS, including who the researchers are, who funds them and who (and what) they study. Since the human genome was sequenced, thousands of papers and accessions have identified multiple genetic loci related to diseases such as type 2 diabetes and Alzheimer’s, psychiatric disorders like schizophrenia or autism, physical traits such as height and BMI and behavioural and psychological characteristics such as neuroticism.

    • 1.2 What was the motivation of this study? We noticed some striking aspects about the data being used and the way in which research was conducted. For example, there were geographical concentrations across only a few countries and the data itself was highly selective. Although we had some hunches, we found that there was no systematic or empirical analysis to see if this was the case. We were surprised to find what we did, despite our initial prior hypotheses.

    • 1.3 What is 'scientometrics'? Scientometrics broadly refers to the study of measuring and analysing science, technology and innovation. The main goal is often to map scientific fields and take stock of the research to date. The methods we employ are primarily computational, but we also include qualitative manual coding of the primary data sources used to avoid false positives arising from any potential text mining exercises. Scientometrics has gained importance due to the interest of governments, funders and research bodies in assessing the impact of the research which they have funded. A prominent individual indicator of scientific prowess is the h-index (a metric based on an individual's most cited papers), often used to gauge the productivity and impact of scientists. For example, an h-index of 50, would mean that the author had 50 papers that have all been cited at least 50 times. As described in FAQ 2.2, we develop a new measure in the form of a 'GWAS h-index'. This allows us to determine the extent to which participating in GWAS research contributes to a scientists impact in this specific field.

    • 1.4 What are the scientific benefits of diversity in genetic research? GWAS which utilize data from diverse populations will provide more accurately targeted therapeutic treatments to more of the world’s population, extend insights in the allelic architecture of traits and uncover rare variants with significant effect sizes which replicate across ancestries. A recent example of this, for instance, was a GWAS on Greenlandic Inuit which found selected alleles with large effect sizes for height: a finding which replicates in Europeans but was hitherto undetected due to infrequent genetic occurrence across commonly utilized samples. Increasing the number of different combinations of DNA which are studied can have huge scientific benefits, and recent genetic sequencing of Khoi-San bushmen showed that even two people from adjacent villages were as different from one another as any two European or non-African ancestry individual.

    • 1.5 What is a Genome-Wide Association Study (GWAS)? A Genome-Wide Association Study or GWAS (pronounced Gee-WAS) is a search across the entire human genome, examining each genetic locus (or region) one by one to see if there is a statistical relationship (an 'association') between 'traits'. A trait is often interchangeably called a phenotype or outcome. The genetic loci that are identified contain SNPs (pronounced SNIPs: single-nucleotide polymorphisms), or in other words, the genetic variants that distinguish us from each other. Humans are 99.9% identical to each other (about one SNP per 1000 bases), and it is the 0.1% by which we differ that makes us all genetically unique. Only in the case of 'Mendelian diseases' like Huntington's disease is there only one gene that is at play. The majority of traits which we are interested in are complex and are the result of multiple genetic loci combining (referred to as 'polygenic traits'), where often hundreds and thousands of genetic variants have a small influence on a phenotype.

    • 1.6 How many genetic discoveries have been made to date? Since the publication of the first GWAS in March 2005 up until December 31st 2018, there have been over 3,700 papers indexed by the NHGRI-EBI GWAS Catalog (hereafter the 'Catalog') produced by the US National Human Genome Research Institute (NHGRI), comprising over 6000 different study accessions. Figure 1 in the article shows the incredible growth over time. As the number of people (and the size of individual samples) studied grew, so too did the number of diseases studied, the number of associations found and the number of different journals which publish such work.

    • 1.7 What did we already know about the undertaking of genetic discovery before this study? Why is it important to examine this topic? There have been some excellent scientific reviews by leading geneticists and genetic ethicists in the field, but these were narrative and more subjective. Although genetic discoveries require substantial funding and are increasingly used for pharmaceutical development, we found it striking that no systematic review had taken place regarding the broader undertaking of such research. There were a few recent studies that noted that this research was largely based on European-ancestry, but lacked a deeper analytical aspect. For example, none of the previous studies distinguished between 'discovery' and 'replication' samples. When we analysed this, we found that when non-European ancestry groups were examined, it was actually largely in replication to test whether the genetic loci found in European ancestry populations were the same in other groups (i.e. not discovering new loci). The focus of the research to date had almost exclusively looked at ancestry and rarely examined other types of diversity. We therefore found it important to extend the discussion beyond ancestry.

    • 1.8 What is the difference between ancestry as defined in genetics and race? There is often considerable confusion and misinterpretation of the terms 'ancestry' and 'race' in human genetics research, and these terms are not interchangeable. Genetic variation needs to be distinguished from the social, cultural and political meanings as ascribed to different human groups. Race is not a biological category since genetic variation is traced to geographical locations and does not map into the perpetually evolving and socially defined racial or ethnic groups. Populations are the product of admixture across tens of thousands of years, and the concentration of genetic alleles in some groups is thus related to where they have descended from and has nothing to do with the social categorisation of race.

    • 1.9 How are the findings different from other similar studies? Previous studies suggested that particularly ancestral diversity was becoming more equal but we show that this is not the case by examining change over time (see Table 1 main article). No-one at previously looked at the geographical location of the people studied. It was remarkable that 72% came from merely 3 countries, US, UK and Iceland (see FAQ 3.2 for why this is a problem). The data used was also very selective and not representative (see FAQ 3.3 for why this is a problem). The network of researchers themselves had never been studied, we showed a tight knit group of researchers that often held key datasets but also strong gender imbalance with 70% of senior last author positions held by men.

  2. How did we study it?

    • 2.1 What kinds of data did we use? We examined over thirteen years of GWAS discoveries (March 2005 to October 2018) from the Catalog. We then linked the Catalog to an external database called PubMed, which has information about the researchers themselves, the journal where it was published and funding acknowledgements. We also compared the GWAS samples to world population data drawing from the United Nations (UN) population data.

    • 2.2 What kinds of analytical methods did we use? We used a primarily automated scientometric approach (see FAQ 1.3) but also manually extracted information on the most commonly used datasets with the help of three diligent research assistants. This covered over 85% of all GWAS by cumulative sample size, giving us reasonable confidence in which to draw our conclusions. We rank and map top funders by ancestry and disease, isolate key consortiums, engage in an analysis of gender and authorship, create a unique GWAS h-index and undertake a social network analysis of author centrality.

    • 2.3 What are the strengths of our approach? We are the first to engage in a:

      • systematic empirical analysis of all GWAS discoveries to date.

      • detailed longitudinal analysis over time.

      • breakdown of ancestry coverage over time and across discovery and replication phases.

      • series of analyses to identify the main countries where participants are recruited from.

      • ranking of the most commonly used datasets.

      • mapping of research by who funded it and what ancestry group and diseases are studied across funding source.

      • network and gendered analysis of the scientists conducting this research.

    • 2.4 What are the limitations of our approach? There are numerous limitations to our approach, including the fact that we:

      • were limited by the lack of clear identifiers for data within the free-text of articles field (meaning that our manual data curation was extremely labour intensive, limiting it to a third of the largest GWAS ranked by total sample size).

      • were limited by the quality and validity of the meta-data provided by authors.

      • were only able to provide broad conclusions, with more detailed and nuanced analyses left for follow-up work.

  3. What are some of the central findings?

    • 3.1. Genetic discoveries largely arise from samples comprised of European ancestry based populations. Contrary to messages that this is decreasing over time, we show it may not be the case. We also show that although non-European groups are increasingly included, it is largely to ‘replicate’ European ancestry based results (and not for initial genetic discovery). Although large funders such as NIH (National Institute of Health) have implemented diversity policies, the international diversity of inclusion has actually fluctuated markedly over time (see Table 1, main article).

    • 3.2. 72% of all genetic discoveries come from people living in just three countries: the United States, United Kingdom and Iceland. Until now, the focus has been on ancestral diversity with very little attention regarding the origins of where subjects are recruited from (Figure 3 and Table 2, main article). For example, Iceland has a population of around 334,000 but represents around 11.5% of all subjects included in genetic discoveries to date. These countries are relatively unique in their social, cultural and economic backgrounds but also in their specific disease profiles. A large body of epidemiological and demographical studies have shown that when and where you are born matters in relation to life expectancy and disease prevalence.

    • 3.3. The most frequently studied people in GWAS discoveries are often older and more likely to be female, with large samples consisting of ‘healthy volunteers’ with a higher socio-economic status. Considering 'healthy volunteer selection' is important, because if we would then study smoking or other detrimental health behaviours and outcomes in a study such as the UK Biobank or others where few people smoke cigarettes and are largely healthy, we may falsely conclude that particular genetic loci might be protective or beneficial for health. Conversely, if we used data from a sample where smoking was widespread and individuals were in general less healthy, those with particular genetic loci might in fact be the ones with the poorest health outcomes. If we used a broad and representative sample we may not be able to observe any clear association. As Jason Boardman and colleagues conclude: “Without a complete representation of the individuals across the full range of environments, researchers can only tell one part of the story.” (p. S68).

    • 3.4. We summarize the network of contributing authors to show a tightly-knit group of researchers who worked hard to construct, maintain and provide access to the most frequently used cohorts. Many of the most 'central' and prolific authors are in some-way related to deCODE genetics (an influential biopharmaceutical company based in Reykjavík, Iceland) or are leaders of longitudinal cohort studies.

    • 3.5. Gender imbalance is not specific to this area of research, with a growing number of studies flagging gender imbalance in scientific publications and funding. However, we estimate that:

      • 37% of authors are female, which is considerably higher than the historical average of women undertaking research in Molecular and Cell Biology (21% from 1665-1989).

      • 44% of junior ‘first author’ positions are women.

      • 70% of senior ‘last author’ positions are men.

      • Women had similar, albeit slightly lower GWAS h-indexes (4.85 female; 5.34 male), publications per author (6.15 female; 7.17 male) and citation counts (648 female; 781 males).

      • There were gender differences in what was studied.

  4. What are the medical, societal, scientific and commercial implications of this study? 

    • Our study is crucial for researchers, data providers, editors and consortiums working in this area to understand the strengths and potential gaps in current research and is essential to plan future investments in data collection and science policy for funders, research bodies and national governments. Growing investments from pharmaceutical companies are being made in this area to translate this research into drugs. Our study highlights that the genetic findings to date are of highly selective populations, coming largely from 3 countries and are often not representative of the wider global population. We offer ten concrete evidence-based policy recommendations:

      • 4.1. Prioritize multiple types of diversity (ancestral, geographical, environmental, temporal and demographic) and recognize the impact that this lack of diversity has on research findings.

      • 4.2. Monitor diversity or gaps in research with funding sanctions and consequences.

      • 4.3. Careful interpretation of genetic differences between ancestral groups.

      • 4.4. Local participant and researcher involvement in under-represented communities.

      • 4.5. Action to reduce inequalities in authorship and principal investigators.

      • 4.6. Reform incentive structures that intertwine role of authorship, data ownership and data sharing.

      • 4.7. Enforce ways to trace this type of research better. This includes Digital Object Identifiers (DOIs) for datasets and enforce linked ORCIDs for atuhors.

      • 4.8. Coordinated governance from multiple stakeholders for genomic data collection and sharing.

      • 4.9. Enforce sharing of GWAS summary results.

      • 4.10. Call to all within the genomics ecosystem to utilize their influence for the good of more people.

  5. Additional general questions (updated over time as we receive additional questions)

    • 5.1. Can I find the full study online for free? Yes! This article is freely available through open access.

    • 5.2. Can I see all of the data and computer code that you used if I want to check or replicate your result? Yes you can! A full standalone GitHub repository which predominantly runs off a Jupyter Notebook and supporting functions accompanies this article as Replication Material.

    • 5.3. Where can I find the full list of all GWAS author rankings, all datasets used or funder list? In the article we were limited to showing only the top 10 most prominent GWAS authors (Table 4, main article). To find a full list of author rankings, please see the GitHub page for this project.

  6. References

    • Boardman, J. D., Daw, J. & Freese, J. Defining the Environment in Gene–Environment Research: Lessons From Social Epidemiology. Am J Public Heal. 103, 64–72 (2013).

    • Fry, A. et al. Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants with Those of the General Population. Am. J. Epidemiol. 186, 1026–1034 (2017).

    • Fumagalli, M. et al, Greenlandic Inuit showgenetic signatures of diet and climate adaptation. Science (80-.). 349, 1343–1347 (2015)

    • MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).

    • Popejoy, A. B. et al. Genomics is failing on diversity. Nature 538, 161–164 (2016).

    • Tropf, F. C. et al. Hidden heritability due to heterogeneity across seven populations. Nat. Hum. Behav. (2017). doi:10.1038/s41562-017-0195-1

    • Schuster, S. C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943–947 (2010). Visscher et al. P.M., 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101, 5–22 (2017).

    • Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, 1001–1006 (2014).

    • West, J.D. et al, The Role of Gender in Scholarly Authorship. PLoS One. 8 (2013), doi:10.1371/journal.pone.0066212.