JDRF/Wellcome Trust

Diabetes and Inflammation Laboratory


HLA data

Data subjects:

Briefly, HLA typing is available, in part, on family subjects who are part of the BDA/Warren Affected Sib Pair collection:

Bain, S.C., Todd, J.A. & Barnett, A. H. (1990).
The British Diabetic Association: Warren Repository.
Autoimmunity 7, 83-85 [PubMed: 2151758]

and for the casecontrol collection, especially as represented in two Genome-Wide Association Studies (GWAS):

Wellcome Trust Case Control Consortium (2007)
Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature, 447, 661-678. [PubMed: 17554300]

Barrett, J.C., Clayton, D.G., Concannon, P., Akolkar, B., Cooper, J.D., Erlich, H.A., Julier, C., Morahan, G., Nerup, J., Nierras, C., Plagnol, V., Pociot, F., Schuilenburg, H., Smyth, D.J., Stevens, H., Todd, J.A., Walker, N.M. & Rich, S.S. The Type 1 Diabetes Genetics Consortium (2009)
Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes.
Nat Genet, advanced online publication 10.1038/ng.381. [PubMed: 19430480]

Access arrangements are discussed below.

Typing methods and scoring:

We currently perform HLA typing using DYNAL technologies from Invitrogen.

We normally start with their medium throughput product "RELI SSO" and, if the resolution is insufficient, then use SSP - see their schematic, with links to the SSO and SSP protocols.

The key points for the non-HLA-specialist are:

  • HLA "alleles" are actually exon+ length haplotypes from the very polymorphic HLA genes. Further details on the naming conventions may be found at http://www.anthonynolan.org.uk/HIG/lists/nomenlist.html;
  • the methods we use to measure HLA alleles are not DNA strand-specific, and so requires scoring software to identify plausible pairs of alleles that make up the set of results seen;
  • some results could be generated by 2 or more independent pairs of HLA alleles;
  • many HLA alleles are not unambiguously measured by the process.

The upshot is that there are some systematic ambiguities in the data.

These may be dealt with either by assuming that the most common result is the correct one - HLA alleles are ethnicity-specific, and many rare alleles are almost unknown in white northern European populations - and/or by storing these ambiguities. We do both.

First we constrain alleles to come from population-specific lists, these are for white northern Europeans:

Please note, allele frequencies are indicative only, and need not add to 100%. While these data have been accumulated in the lab (and at DYNAL) over several years, wider public efforts are maintained at dbMHC, and http://www.allelefrequencies.net/.

If the scoring software suggests only one pair of alleles is possible, and one or both of these is not on the list, then the sample is re-typed, and the result accepted if identical on a second attempt.

On the other hand, if the scoring software suggests that the result may have come from two or more (pairs of) plausible alleles, then this is recorded as an ambiguous result.

A decision is then taken as to which ambiguities are important to our research, and these are typed with a different assay. For example, DYNAL SSO cannot distinguish between HLA-DRB1 alleles 1101 and 1104, both of which are reasonably common, but we're content to treat these as a generic "11". Failure to distinguish common HLA-DRB1 04 subtypes (e.g. 0401 vs 0405) will require further typing, as these are known to have different disease risks in Type 1 Diabetes.

The overall impact, however, is that some ambiguities remain, at least in the short term - and in extreme cases, when the data is ambiguous at the 2-digit naming level, the data is considered missing, awaiting retyping, giving rise to "half-calls" on export.

As a note of warning: if these half-calls are dropped, this may give rise to bias in analysis - for example we have found the HLA-DRB1 result:

  • allele 1 = (0301 or 1305), exported as 0 (missing)
  • allele 2 = (1101 or 1104), exported as 11

is much more common in controls than cases.

Data recoding:

The Anthony Nolan webpage referenced above states:

The first two digits describe the type, which often corresponds to the serological antigen carried by an allotype, The third and fourth digits are used to list the subtypes, numbers being assigned in the order in which DNA sequences have been determined. Alleles whose numbers differ in the first four digits must differ in one or more nucleotide substitutions that change the amino acid sequence of the encoded protein. Alleles that differ only by synonymous nucleotide substitutions (also called silent or non-coding substitutions) within the coding sequence are distinguished by the use of the fifth and sixth digits. Alleles that only differ by sequence polymorphisms in the introns or in the 5' or 3' untranslated regions that flank the exons and introns are distinguished by the use of the seventh and eight digits.

DYNAL SSO resolves to the level of a series of alternate alleles.

We typically reduce or recode these, and in analysis we use:

Gene Codings
HLA-A HLAA_2digit, HLAA_4digit
HLA-B HLAB_Bw4_Bw6, HLAB_supertype, HLAB_2digit, HLAB_4digit
HLA-Cw HLAC_2digit, HLAC_4digit, HLAC_NK (NK specificity)
HLA-DQB1 DQB1_4digit
HLA-DRB1 DRB1_34x, DRB1_2digit, DRB1_04subtypes, DRB1_epitope

Each gene is linked to a file of aggregate results from the National Child Development Study, aka 1958 British birth cohort, as available, showing frequencies for each allele (for alleles where n >= 10, for a total maximum n of 2 * 1829), remaining ambiguities, and the mappings between the various codings - as at 9th January 2008, these are in need of updating.

The rough geographical spread of the 1958 British birth cohort typed for HLA-DRB1 is also available.

A note on codings:

In each case, these codings can be considered to be simplifying mappings of the underlying data.

To give a concrete example, the codings used for HLA-DRB1:

First, all codes are numbers, and so leading zeroes are removed, and 0 is reserved to be be a missing value.

The DRB1_34X variable comes about because HLA-DRB1 is the single most associated gene in type 1 diabetes (T1D), and the big association is a genotype effect related to having one each of 2 particular groups of alleles. Simplifying, these groups are the "03" alleles (mostly 0301 but also 0302), and the "04" alleles (mostly 0401, but also 0402, 0403, 0404, 0405, 0406, 0407, 0408, and even the relatively rare 0409 and 0413). This grouping is not just an artefact, but reflects ancient haplotypes that predate speciation - see e.g.:


Since DR51 (incl. DR1/10) and DR52 (incl. DR8) haplotypes seem to share a common ancestry, it is possible to divide all HLA-DR haplotypes into two evolutionarily related groups: DR53 group and non-DR53 group as direct descendants of the two primordial DRB genes, i.e., HLA-DRB1*04 and HLA-DRB1*03, respectively.

Thus a nice, simplified approximation of HLA risk in T1D, is given by the 6 DRB1_34X genotypes:

3/3, 3/4, 3/(not 3 or 4), 4/4, 4/(not 3 or 4), and (not 3 or 4)/(not 3 or 4)

The (not 3 or 4) is traditionally (in T1D-circles) called X, which we code as 9. Therefore, DRB1_34X is represented by only 7 genotypes:

3/3, 3/4, 3/9, 4/4, 4/9, 9/9 and 0/0 for missing.

The DRB1_2DIGIT coding is next up, in terms of increasing complexity: it gives the first 2 digits of all the (not 3 or 4) alleles. We only use this if some of the HLA typing lacks resolution.

The coding we have been most likely to analyse, is DRB1_04SUBTYPES. This is like DRB1_2DIGIT, only [a] with 4-digit resolution for the "04" alleles, where known, and [b] some sense of classes for hard-to-resolve "04" alleles. This was introduced when it was discovered that some of the 04 alleles are higher risk than others.

The fullest coding is the DRB1_4DIGIT one, which is our best representation of the allele. Sometimes we can't distinguish a pair of alleles, in which case we include it by its 2 digit representation - e.g. if we can't tell 1101 from 1104, it will be included as 11.

Data access:

Data for the Warren family study are available from this website - please see further documentation.

Data for the 1958 Birth Cohort linked to ID is not available here, but must be requested through the 1958 Birth Cohort, who have chosen to delegate this role to the WTCCC Data Access Committee.

Data for the T1D cases that appeared in the WTCCC is also available from the WTCCC, while the status of data sharing for T1D cases in the T1DGC GWAS, is currently unclear.

To date, we have delivered:

  • January 2005: HLA-DRB1, -DQB1 and -B data for up to 1829 subjects
  • January 2006: update, especially on "half-called" HLA-DRB1 genotypes
  • March 2007: update, with the addition of HLA-A genotypes
  • January 2008: update, with the addition of HLA-C genotypes
  • July 2008: general update of especially WTCCC subjects
  • May 2009: general update of especially WTCCC subjects, including data from T1DGC GWAS subjects

Please note: on export, data is presented as pairs of alleles for some or all of the above codings. Data is not phased.