Research ArticleImmunology

Overlap and Effective Size of the Human CD8+ T Cell Receptor Repertoire

See allHide authors and affiliations

Science Translational Medicine  01 Sep 2010:
Vol. 2, Issue 47, pp. 47ra64
DOI: 10.1126/scitranslmed.3001442

Abstract

Diversity in T lymphocyte antigen receptors is generated by somatic rearrangement of T cell receptor (TCR) genes and is concentrated within the third complementarity-determining region 3 (CDR3) of each chain of the TCR heterodimer. We sequenced the CDR3 regions from millions of rearranged TCR β chain genes in naïve and memory CD8+ T cells of seven adults. The CDR3 sequence repertoire realized in each individual is strongly biased toward specific Vβ-Jβ pair utilization, dominated by sequences containing few inserted nucleotides, and drawn from a defined subset comprising less than 0.1% of the estimated 5 × 1011 possible sequences. Surprisingly, the overlap in the naïve CD8+ CDR3 sequence repertoires of any two of the individuals is ~7000-fold larger than predicted and appears to be independent of the degree of human leukocyte antigen matching.

Introduction

The antigenic specificity of T lymphocytes, which primarily recognize peptide antigens presented by class I and II molecules of the major histocompatibility complex (MHC), is in large part determined by the amino acid sequence in the hypervariable complementarity-determining region 3 (CDR3) of the α and β chains of the T cell receptor (TCR). The nucleotide sequences that encode the CDR3 regions are generated by somatic rearrangement of noncontiguous variable (V), diversity (D), and joining (J) region gene segments for the β chain, and V and J segments for the α chain. The existence of multiple V, D, and J gene segments in germline DNA permits substantial combinatorial diversity in receptor composition, and receptor diversity is further augmented by the deletion of nucleotides adjacent to the recombination signal sequences (RSSs) of the V, D, and J segments and template-independent insertion of nucleotides at the Vβ-Dβ, Dβ-Jβ, and Vα-Jα junctions.

Using a standard model (1) that allows for up to six nucleotide insertions each at the Vβ-Dβ and Dβ-Jβ junctions, together with updated human genome data, we calculated the possible CDR3 amino acid sequence diversity of the TCR β chain alone at ~5 × 1011. We previously determined that the number of unique TCRβ CDR3 sequences expressed in peripheral blood T cells of a healthy adult is ~3 × 106 (2), and thus, the potential diversity of TCRβ CDR3 sequences far exceeds the diversity that is realized in one individual at one time. The magnitude of this diversity has to date made global comparison of the TCR repertoires present in different individuals virtually impossible.

Using a molecular and computational approach that exploits the sequencing capacity of the high-throughput Illumina Genome Analyzer (GA) II DNA sequencer (2), we can now directly determine, in a single experiment, the rearranged TCRβ CDR3 sequences carried in millions of αβ T cells. Moreover, the use of T cell genomic DNA, as opposed to complementary DNA, as a template for sequencing permits estimation of the relative frequency of each CDR3 sequence in the population of T cells. The development of these techniques now enables us to perform global comparisons of the actual TCRβ CDR3 repertoires in two or more individuals. Here, we examine the repertoire of TCRβ CDR3 sequences expressed in naïve and memory CD8+ T cells from the peripheral blood of seven healthy adults and assess the extent to which the CD8+ TCRβ CDR3 repertoires realized in each individual share specific sequences. We find that the observed repertoire overlap between any two individuals is several thousandfold larger than expected, and the repertoire of each individual is strongly biased toward sequences using specific Vβ and Jβ gene segments.

Results

We applied a high-throughput sequencing approach (2) to genomic DNA from purified naïve (CD45RO, CD45RAhi, CD62Lhi) and memory (CD45RO+, CD45RAint/neg) CD8+ T cells and assessed the realized CD8+ TCRβ CDR3 sequence repertoire in the blood of seven healthy adults (Table 1). More than 5 million TCRβ CDR3 sequence reads were generated from ~1 million template genomes in each of the seven naïve and seven memory samples. A mean of 420,000 unique CDR3 nucleotide sequences was observed in the naïve samples, and 69,000 unique nucleotide sequences in the memory samples (Table 2).

Table 1

Donor characteristics. Demographic information; HLA-A, HLA-B, and HLA-C typing data; and EBV serostatus for the seven individuals who were studied in this report. “+” indicates detectable serum immunoglobulin G against the viral capsid antigen of EBV. Donors 1 and 3 are full siblings and the daughters of donor 2; the other four donors are unrelated.

View this table:
Table 2

Summary of TCRβ CDR3 sequence data. The number of in-frame, readthrough and out-of-frame TCRβ CDR3 nucleotide sequence reads obtained from the CD8+CD45ROCD45RAhiCD62L+ (naïve) and CD8+CD45RO+CD45RAlow (memory) T cell samples from each of the seven donors, and the corresponding numbers of unique in-frame CDR3 nucleotide sequences are indicated.

View this table:

Utilization of Vβ, Dβ, and Jβ gene segments in the naïve and memory CD8+ repertoires

The frequency with which specific Vβ-Jβ combinations were used was highly variable in each of the seven individuals (Fig. 1, A and B). Although every possible Vβ-Jβ combination was observed, the frequency with which we saw specific combinations varied by more than 10,000-fold. Vβ-Jβ utilization was surprisingly consistent between individuals, however, especially for the rare Vβ-Jβ pairs, as reflected by the fact that the variance in Vβ-Jβ utilization was proportional to mean utilization. A fraction of the TCRβ CDR3 sequences in the genomic DNA from the naïve and memory CD8+ T cells of each of the seven donors was predicted to generate out-of-frame TCRβ transcripts that do not encode functional TCRβ chains (Table 2). The Vβ-Jβ utilization in the out-of-frame CDR3 sequences was highly nonuniform and qualitatively similar to that observed in in-frame transcripts (Fig. 1C). The variability of Vβ-Jβ utilization in the out-of-frame CDR3 sequences cannot be attributed to positive or negative selection in the thymus of T cells bearing specific receptors because these sequences do not generate proteins that participate in the selection process. The similarity in the utilization of specific Vβ-Jβ combinations in out-of-frame, nonfunctional and in-frame, functional TCRβ transcripts therefore suggests that the variability in Vβ-Jβ utilization in both sets of sequences is attributable, at least in part, to mechanisms that operate before the stage of thymic selection.

Fig. 1

Nonuniform utilization of Vβ-Jβ gene segment combinations in CD8+ T cells. (A and B) Mean utilization frequency of specific Vβ-Jβ gene segment combinations in the TCRβ CDR3 sequences expressed in naïve (A) and memory (B) CD8+ T cells of seven healthy adults. All 13 Jβ segments are indicated along one axis, and 38 of the 54 Vβ segments are indicated along the other axis. Combinations containing gene segments belonging to the Vβ5, Vβ6, and Vβ12 families are not displayed, because the segments in these families have extremely high sequence similarity at their 3′ ends and could not be unambiguously distinguished given the 60-nucleotide sequence reads obtained in this study. (C) Mean utilization frequency of specific Vβ-Jβ gene segment combinations in TCRβ CDR3 sequences observed in naïve CD8+ T cells and predicted to generate out-of-frame TCRβ transcripts that would not encode functional TCRβ chain proteins. The error bars in (A) to (C) indicate SD.

The observed frequencies of specific Vβ-Dβ-Jβ combinations suggest that rearrangement between Vβ and Dβ gene segments is random, whereas that between Dβ and Jβ gene segments is not (Fig. 2). The apparent nonrandom association between specific Dβ and Jβ gene segments is likely attributable to the organization of the TCRβ locus, in which Dβ1 lies 5′ of all 13 Jβ segments, whereas Dβ2 lies 3′ of the six members of the Jβ1 cluster but 5′ of the seven members of the Jβ2 cluster. The Dβ1 segment is observed at roughly equal frequency with all 13 Jβs, whereas Dβ2 is much more frequently paired with members of the Jβ2 compared with the Jβ1 family. Dβ2 is observed with members of the Jβ1 family about a third (0.30 ± 0.05) as often as would be expected if the pairing were random.

Fig. 2

Relative utilization of specific Dβ-Jβ gene segment pairs in naïve and memory CD8+ T cells. (A and B) The proportion of all in-frame, readthrough TCRβ CDR3 sequences obtained from naïve or memory CD8+ T cells from each of the seven donors, and with either the Dβ1 or Dβ2 gene segments and a Jβ gene segment from either the Jβ1 or the Jβ2 cluster, is indicated.

Much of the predicted diversity in the TCRβ CDR3 repertoire is generated by nontemplated nucleotide insertions at the Vβ-Dβ and Dβ-Jβ junctions. The cumulative distribution of TCRβ CDR3 sequences observed in the CD8+ naïve and memory compartments, respectively, of the seven donors as a function of the number of junctional insertions demonstrates that sequences with 12 or more insertions were observed, but constitute only 10% of the total (Fig. 3, A and B). In contrast, more than 10% of the observed sequences had zero, one, or two insertions, and 50% of the sequences in each donor had six or fewer total insertions at the two junctions.

Fig. 3

Distribution of the total number of nucleotide insertions in CD8+ TCRβ CDR3 sequences. (A and B) Cumulative distribution of TCRβ CDR3 amino acid sequences in naïve (A) and memory (B) CD8+ T cells in the blood of seven healthy adults as a function of the total number of nucleotide insertions at the Vβ-Dβ and Dβ-Jβ junctions. The red lines were added to facilitate assessment of the fraction of sequences containing six or fewer total junctional insertions. An analogous cumulative distribution for unique amino acid sequences, in which each unique amino acid sequence is counted once instead of according to its copy number, is found in fig. S3. (C and D) The TCRβ CDR3 sequences observed in the CD8+ naïve and memory compartments, respectively, of each donor were compared to the complete set of unique sequences generated by a model of TCRβ VDJ rearrangement that allows deletion of up to 10 nucleotides adjacent to the RSS of the Vβ, Dβ, and Jβ gene segments, followed by less than or equal to the indicated number of total nucleotide insertions at the two junctions. The fraction of observed sequences in the seven donors that match a sequence generated by the model is shown for naïve cells in (C) and memory cells in (D). (E) The exact number of unique TCRβ CDR3 sequences predicted by the model for 0, 1, 2, …, 7 total insertions, as well as an estimate of the number of sequences predicted by the model for 12 total insertions.

Effective size of the CD8+ CDR3 repertoire

To determine whether the CD8+ TCRβ CDR3 repertoire in each individual is randomly sampled from the set of 5 × 1011 possible sequences, we explicitly enumerated the complete set of CDR3 sequences predicted by a model of VDJ recombination that allowed up to 10 nucleotide deletions from the ends of the Vβ, Dβ, and Jβ gene segments adjacent to the RSS, followed by insertion of a total of up to seven nontemplated nucleotides at the Vβ-Dβ and Dβ-Jβ junctions (fig. S1). The model allowed the total CDR3 length to range from 27 to 69 nucleotides, encoding 9 to 23 amino acids, consistent with our experimentally observed sequence data. Generation of all unique CDR3 amino acid sequences containing a total of seven or fewer nucleotide insertions at the Vβ-Dβ and Dβ-Jβ junctions required ~10,000 central processing unit (CPU) hours on a 2.3-GHz processor (Materials and Methods). Comparison of the set of sequences observed in the naïve and memory CD8+ repertoires of each of the seven donors with the full set of sequences generated by the model (Fig. 3, C and D) reveals that 51.1 ± 3.5% of all the sequences observed in the naïve CD8+ compartment of each donor are found in the subset of 5.7 × 108 predicted sequences containing six or fewer total insertions (Fig. 3E). Thus, most TCRβ CDR3 amino acid sequences are drawn from a defined subset of less than 0.11% of the estimated 5 × 1011 possible TCRβ sequences.

Large overlap of TCRβ CDR3 sequence repertoires between individuals

We previously demonstrated that at least 3 × 106 distinct TCRβ CDR3 amino acid sequences are found in the peripheral blood T cell compartment of an adult (2), which implies that any two individuals would be expected to share less than five CDR3 sequences if the TCRβ CDR3 repertoire in an individual were randomly chosen from a uniform distribution of 5 × 1011 different sequences (Materials and Methods). The fact that the effective size of the possible CD8+ TCRβ CDR3 repertoire implied by our sequence data is far smaller than the estimated 5 × 1011 possible sequences suggested that the overlap between the CD8+ CDR3 repertoires of different individuals might be significantly larger than expected. Indeed, when we compared the naïve CD8+ subsets of any two of the seven individuals who were studied, we found more than 10,000 identical TCRβ CDR3 amino acid sequences (Figs. 4 and 5A; Materials and Methods). The seven individuals include two sisters who had identical human leukocyte antigen–A (HLA-A), HLA-B, and HLA-C alleles; their mother; and four unrelated individuals of diverse ethnic and geographic origin who shared few HLA-A, HLA-B, or HLA-C alleles with each other or with the mother-daughter trio (Table 1). An overlap of more than 10,000 TCRβ CDR3 sequences was even observed between the naïve CD8+ repertoires of individuals 6 and 7, who shared no HLA-A, HLA-B, or HLA-C alleles. The overlap between the memory CD8+ repertoires of any two individuals was smaller, as expected, because the composition of each individual’s memory repertoire is determined by his or her cumulative history of antigenic exposures (Fig. 5A). Nonetheless, the mean overlap between the memory CD8+ TCRβ CDR3 repertoires of any two individuals exceeded 1000 sequences. The pairwise overlap of the naïve CD8+ subsets of the seven donors predicted by our model of TCRβ VDJ rearrangement is 1.44 × 104 ± 1.66 × 103 sequences (Materials and Methods), which agrees closely with the overlap calculated between all 21 possible pairs of the seven individuals studied (Fig. 5A). CDR3 sequences that were shared between individuals had fewer inserted nucleotides than the mean number observed across the entire repertoire (Fig. 5, B and C).

Fig. 4

Calculation of the total number of TCRβ CDR3 amino acid sequence overlaps between the naïve CD8+ T cell compartments of two donors. The colored symbols indicate the actual number of TCRβ CDR3 amino acid sequences that were observed in both the naïve CD8+ T cell repertoire of donor 3 and that of the indicated other donors, identified with the sorting approach outlined in Materials and Methods. The solid lines indicate the regression models fitted to the observed data, and the y-axis intercepts at the right-hand edge of the figure (at x = 10 × 1011) indicate the estimated number of CDR3 sequences that are shared between the entire repertoires of donor 3 and the indicated other donors.

Fig. 5

Characteristics of shared CD8+ TCRβ CDR3 amino acid sequences. (A) Number of shared sequences in the naïve and memory CD8+ CDR3 repertoires of every possible pair of individuals, with the HLA-identical sisters indicated by a blue circle, each of those sisters paired with their mother indicated by orange squares, and the remaining pairs, all of which contained two unrelated individuals, indicated by black triangles. There are 7!/2!5! = 21 different pairs. (B and C) Frequency distribution of shared (red) and nonshared (blue) CDR3 sequences observed in the naïve (B) and memory (C) CD8+ compartments of every possible pair of individuals as a function of the total number of nucleotide insertions at the Vβ-Dβ and Dβ-Jβ junctions.

Characteristics of TCRβ CDR3 sequences with few junctional insertions

Using the TCRβ CDR3 sequence to identify clonally derived T cells, we compared the CDR3 sequence repertoires of the naïve and memory CD8+ compartments of each of the seven donors and identified the subset of sequences that were observed in both compartments to track the fate of individual T cell clones. CDR3 sequences with high relative frequencies in the naïve CD8+ compartment were more likely than sequences with low relative frequencies to be observed in the memory compartment (Fig. 6). Confirming previous observations from our group (2) and others (3), we also observed that CDR3 sequence abundance is inversely correlated with total junctional insertions in both the naïve (Fig. 7A) and the memory CD8+ compartments. The higher frequency with which sequences carrying few or no junctional insertions were observed was not due to biased amplification or sampling, because the abundance of CDR3 sequences generating out-of-frame TCRβ transcripts showed no such dependence on the number of junctional insertions (Fig. 7B). Analyzing the subset of in-frame CDR3 sequences that were observed in both the naïve and the memory CD8+ compartments, we observed no correlation between the frequency with which individual sequences were observed in the naïve compartment and their frequency in the memory compartment (fig. S4). These results suggest that the size of CD8+ clones in the naïve compartment is correlated with their probability of entering the memory compartment, but not with their size in the memory compartment. By extension, these results imply that the high prevalence of clones in the memory compartment bearing receptors with few junctional insertions is not attributable to their high prevalence in the naïve compartment.

Fig. 6

Relative frequency of TCRβ CDR3 sequences observed in the CD8+ naïve compartment classified according to their appearance in the CD8+ memory compartment. The mean copy number in the naïve subset of CDR3 sequences observed in both the CD8+ naïve and the memory compartments of the seven donors is higher than the mean copy number of sequences observed only in the naïve compartment.

Fig. 7

The high prevalence of CD8+ TCRβ CDR3 sequences with few nucleotide insertions is not due to biased amplification or sampling. (A and B) Relative abundance of (A) in-frame, readthrough and (B) out-of-frame TCRβ CDR3 sequences as a function of the total number of nucleotide insertions at the Vβ-Dβ and Dβ-Jβ junctions observed in naïve CD8+ T cells from each of the seven donors studied.

The preponderance of CDR3 sequences with few nontemplated insertions in the CD8+ TCRβ repertoire, particularly in the memory compartment, suggests that the capacity to insert nucleotides at the Vβ-Dβ and Dβ-Jβ junctions may not be required for many CD8+ immune responses. Mice deficient for terminal deoxynucleotidyl transferase, the enzyme that catalyzes the template-independent insertion of nucleotides at the junctions, have 10-fold less diversity in their TCR CDR3 repertoires, with few insertions, yet these mice appear healthy, make efficient and specific immune responses, and display no increased susceptibility to infection (4, 5). This, in turn, suggests the possibility that the Vβ, Dβ, and Jβ segment sequences that contribute to recurrently generated TCRs could be subjected to evolutionary pressures favoring sequences recognizing antigens from common pathogens, because these sequences are present in the germline. Indeed, components of the CD8+ T cell response to ubiquitous pathogens such as Epstein-Barr virus (EBV) are characterized by highly conserved TCRβ CDR3 amino acid sequences that are found in multiple individuals and encoded by nucleotide sequences with few junctional insertions (3, 6, 7). We looked for 12 such “public” TCRβ CDR3 sequences that have been associated with the CD8+ response to EBV in individuals who express either HLA-A*0201 or HLA-B*0801 and detected 5 HLA-A*0201–associated, EBV-specific CDR3 sequences in the memory CD8+ compartments of donors 1 and/or 3, both of whom are HLA-A*0201+, and an HLA-A*0801–associated, EBV-specific CDR3 sequence in the memory compartment of donor 7, who is HLA-B*0801+. None of these responses were detected in the other four donors, all of whom were HLA-A*0201 and HLA-B*0801 (table S1). The observation of the HLA-A*0201– and HLA-B*0801–associated, EBV-specific CDR3 sequences only in the three donors expressing one of the associated HLA alleles was statistically significant (P = 0.0002 by two-tailed Fisher’s exact test; table S2).

Discussion

The advent of high-throughput DNA sequencing technology has enabled not only direct determination of the number of distinct TCRβ CDR3 sequences found in the repertoire of a single individual (2), but also, as demonstrated in this study, quantitative comparison of the repertoires realized in two or more individuals. We previously demonstrated that the number of unique CDR3 sequences found in the repertoire of a single individual is 3 × 106 to 4 × 106 (2), which, although large, comprises a negligible fraction of the number of the estimated 5 × 1011 theoretically possible TCRβ CDR3 sequences. If the 3 × 106 to 4 × 106 sequences present in each individual were randomly chosen from a uniform distribution of 5 × 1011 sequences, one would expect fewer than 5 sequences shared between any two individual repertoires. We tested this prediction by comparing the TCRβ CDR3 sequence repertoires present in pairs of healthy adults and found that the actual repertoire overlap between any two individuals is several thousandfold larger than predicted. Moreover, the degree of overlap does not appear to depend strongly on the extent of HLA-A, HLA-B, or HLA-C matching. This finding implies that individual TCRβ sequence repertoires are not randomly selected from the space of possible sequences, that the sequences in this space are not uniformly probable, or both. Distinguishing between these two possibilities is important, because the first possibility would be consistent with convergent evolution during T cell development, and the second implies that the sequence space of receptors in the cellular adaptive immune system is much smaller than presently believed. Convergent evolution is the possibility that a diverse set of TCRs rearranges in the thymus, and the positive and negative selection process favors the same lower-diversity subset of TCRs in each individual. Arranging the TCRβ CDR3 sequences observed in each donor according to the number of junctional insertions demonstrates that the sequences are not randomly sampled from the set of possible sequences. Indeed, most sequences in each donor are sampled from a small corner of the set of possible sequences, from which one concludes that the probability distribution of the sequences in the possible sequence space is not uniform. To determine whether the distribution is sufficiently nonuniform to account for the larger than expected observed overlaps between individual repertoires, we created a simple model that assumes that the probability distribution is uniform within each subset of sequences carrying a specified number of insertions. The model lets the fraction of each subset (with a specified number of insertions) equal the empirically determined value. Given this model, we predicted the expected overlap when the CDR3 sequences are randomly sampled from this distribution. We found that the predicted overlap is ~14,000, which is indistinguishable from the observed overlap in the cohort. Thus, one does not need to invoke nonrandom selection of the sequences in the possible sequence space to account for the observed overlap. In other words, we find no evidence of convergent evolution.

Our results suggest that the small effective size of the CD8+ CDR3 sequence repertoire is primarily attributable to the fact that CDR3 sequences with large numbers of junctional insertions have very low probability. The number of nucleotides inserted into a given junction varies according to a probability distribution, which we have empirically determined from our sequence data. Sequences with few junctional insertions are more probable than those with many insertions and, consequently, are more likely to be observed in multiple individuals.

The significant overlap that exists between the CDR3 repertoires of unrelated individuals suggests that the phenomenon of public T cell responses to common pathogens may be more common than previously thought, and has potential implications for the diagnosis of disease. Public CDR3 sequences associated with the response to specific foreign or self-antigens could possibly serve as surrogate markers for infections or autoimmune diseases, respectively, and such sequences would have considerable clinical significance if they could be detected before the onset of disease. Although the considerable polymorphism in HLA alleles might potentially confound the interpretation of such a diagnostic, we note that risk for the development of many autoimmune diseases is also strongly linked to specific HLA alleles.

Interest in the application of high-throughput sequencing techniques to the study of adaptive immune receptors is growing rapidly (2, 812). Here, we have compared the TCRβ CDR3 repertoires of different individuals. A previous study of the immunoglobulin heavy (IGH) chain CDR3 sequence repertoire in zebrafish (8) found a fish-to-fish overlap that was much higher than expected by chance. Without further study of zebrafish similar to our analysis presented here for human TCRs, one cannot resolve the origin of the overlap as convergent evolution or a bias in the VDJ recombination mechanism. Application of high-throughput sequencing to the human IGH CDR3 sequence repertoire (10, 11) will likely provide useful insights into this question.

Published studies of the human TCRβ CDR3 sequence repertoire with high-throughput sequencing technologies (2, 9, 12) have used a variety of presequencing molecular strategies, sequencing platforms, and analytical approaches to assess the repertoire realized in single individuals (2, 12) or a pool of 550 individuals (9). Because of the different molecular strategies used in the three studies, the number of unique TCRβ CDR3 nucleotide sequences observed varied widely, from a low of ~34,000 [derived from analysis of 40.5 million primary sequence reads obtained on the Illumina GA platform (9)] to more than 500,000 unique sequences [obtained from a comparable number of reads on the Illumina platform (2)]. In our study of TCRβ CDR3 repertoires realized in the naïve and memory CD8+ compartments of seven healthy adults, we observed ~3 million unique CDR3 sequences from ~40 million primary sequence reads on the Illumina platform and compared the repertoires of different individuals.

Analysis of the crystal structure of several ternary αβTCR–peptide–MHC class I complexes [reviewed in (13)] has revealed that the CDR3 loop of the TCRβ chain primarily makes contact with bound peptide, rather than the α1 and α2 helices of the MHC class I heavy chain. The identification of a much larger than expected overlap in the naïve CD8+ CDR3 repertoires of individuals with few, if any, shared HLA-A, HLA-B, or HLA-C alleles suggests that the ensemble of self-peptides that participates in positive and negative selection of the T cell repertoire may likewise share significant overlap despite the distinct peptide-binding characteristics of different HLA alleles.

Materials and Methods

Isolation and purification of naïve and memory CD8+ T cells

After obtaining written informed consent, we isolated naïve and memory CD8+ T cells from blood of seven healthy adults by flotation on Ficoll-Hypaque. CD8+ T cells were enriched from freshly isolated peripheral blood mononuclear cells by immunomagnetic selection with CD8 microbeads (Miltenyi) and then labeled with CD8–Pacific Blue (RPA-T8), CD45RO–fluorescein isothiocyanate (UCHL1), CD45RA-allophycocyanin (HI100), and CD62L-phycoerythrin (Dreg 56; all antibodies obtained from BD Biosciences). Naïve and memory CD8+ subsets were flow-cytometrically purified with FACSAria (BD Biosciences), equipped with 405-, 488-, and 633-nm lasers, and FACSDiva acquisition software. Naïve CD8+ T cells were identified as CD8+/CD45RAhi/CD45RO/CD62Lhi events, and memory CD8+ T cells were identified as CD8+/CD45RAint/neg/CD45RO+ events. Total genomic DNA was extracted from sorted cells with the QIAamp DNA Blood Mini kit (Qiagen). The approximate mass of a single haploid genome is 3 pg. To sample millions of rearranged TCRβ CDR3 regions in each T cell compartment, we isolated 6 to 27 μg of template DNA from each compartment.

Determination of EBV serostatus

EBV immune status of the seven individuals studied in this report was determined by the University of Washington Virology Laboratory.

Sequencing of CDR3 regions from rearranged TCRβ genes

Polymerase chain reaction amplification and sequencing of rearranged TCRβ CDR3 regions were performed as described (2).

Preprocessing of GA sequence data

Raw GA sequence data were preprocessed to remove errors in the primary sequence of each read and to compress the data, as previously described (2).

Identification of TCRβ CDR3 sequences and VDJ decomposition

The TCRβ CDR3 region was identified according to the definition previously established by the International ImMunoGeneTics collaboration (14). Identification of the Vβ, Dβ, and Jβ gene segments contributing to each TCRβ CDR3 sequence was performed with a standard algorithm (14).

Explicit generation of TCRβ CDR3 sequences

All possible TCRβ CDR3 nucleotide sequences containing a total of n = 0, 1, 2, 3, …, 7 nucleotide insertions at the Vβ-Dβ and Dβ-Jβ junctions were systematically generated with the model of VDJ rearrangement shown in fig. S1. The model was governed by a set of rules that were empirically determined from the observed TCRβ CDR3 sequence data from the seven donors. The model allowed all possible V-D-J combinations, deletion of up to 10 nucleotides from the 3′ end of the Vβ segment and from the 5′ end of the Jβ segment, deletion of nucleotides from the 5′ and 3′ ends (up to complete deletion) of the Dβ segment, and up to seven total junctional insertions. The total length of the CDR3 sequence, defined as the interval from the codon for the conserved cysteine at the 3′ end of the Vβ gene segment to the codon for the conserved phenylalanine in the 5′ portion of the Jβ gene segment, was constrained such that it could encode from 9 to 23 amino acid residues. Generation of the CDR3 amino acid sequences containing a total of seven junctional insertions was performed on a 50-node linux cluster with eight CPUs per node and required 24 hours to complete. Each CDR3 nucleotide sequence was translated into the corresponding amino acid sequence and then stored in a binary tree organized by alphabet. For each new sequence not found in the tree, a new leaf was created with that sequence. For each sequence already found in the tree, the count for the corresponding leaf was incremented by 1.

Calculation of expected overlap between TCRβ CDR3 amino acid sequence repertoires

(i) Expected overlap if all 5 × 1011 TCRβ CDR3 amino acid sequences are equally likely: The general calculation is in (ii), below. Here, n ≈ 1 × 106 is the total number of unique TCRβ CDR3 amino acid sequences in the naïve CD8+ compartment and f is n/(5 × 1011).E[O]n×f

Therefore, the expected overlap is ≈2.

(ii) Estimate of expected overlap from the empirical distribution of sequences by number of insertions: The expected overlap O is the sum of the expected overlaps from the sequences with different numbers of insertions labeled by k.E[O]=k=015E[Ok]

We have explicitly generated the set of all TCRβ sequences with a fixed number k of insertions (Table 1), which we call Tk. We have also empirically determined the fraction fk of the number of TCRβ CDR3 sequences found in blood carrying k junctional insertions (see Fig. 3A). Multiplying fk × n = nk, where n is the total number of TCRβ CDR3 sequences in the naïve CD8+ compartment as determined in (2), we get nk, the number of unique TCRβ CDR3 sequences in the blood with fixed number of insertions k.

The estimate of the number of sequences with k insertions that overlap between two individuals is equivalent to the problem of drawing nk elements twice from a total distribution with Tk elements and determining the number of matches. (Here we are assuming that each sequence with a fixed number of insertions is equally likely.) The expectation is determined from an approximately binomial distribution: E[Ok]=i=1nki(nki)fki(1fk)nk1

We calculate this using the generating function:M(t)=eti(nki)fki(1fk)nk1=[fket+(1fk)]nkE[Ok]=M(0)=nkfkE[O]k=015nkfk

Inserting our empirical data, we find that E[O] ≈ 1.54 × 104, which is consistent with the observed pairwise overlap.

Estimation of the actual overlap between different TCRβ CDR3 repertoires

We previously established a lower bound for the total number of unique TCRβ CDR3 amino acid sequences expressed in the naïve and memory CD8+ T cells circulating in the peripheral blood of a healthy adult (2). Only a fraction of these sequences can be observed experimentally, because the volume of blood that can be sampled at one time represents a small fraction of the total blood volume. The sets of TCRβ CDR3 sequences that are observed experimentally in each donor are enriched for sequences expressed in CD8+ T cell clones that are common in the blood. Estimating the CDR3 sequence overlap between two entire T cell repertoires must therefore take into account the shape of the empirically observed distribution to avoid overestimation. To estimate the total number of TCRβ CDR3 sequences that are shared between the CD8+ T cell compartments of two individuals, we use a nonlinear regression approach in which we fit the observed overlap data with a simple two-parameter model, Y = aXb, where X is the input variable described below, Y is the number of overlapping sequences, and (a, b) are parameters to be estimated from the data.

We first sort the unique TCRβ CDR3 amino acid sequences encoded by the in-frame, readthrough TCRβ CDR3 nucleotide sequences observed in each donor in descending order according to their observed relative frequency. Let Nj be the total number of amino acid sequences in donor j (j = 1, 2), and let nij be the top (or most frequent) ni sequences from donor j, where i indexes the ith sequence. We then define the ith input variable, Xi = ni1 × ni2, as the area equal to the maximum possible number of overlaps between ni1 and ni2 sequences. The outcome Yi is the observed number of overlaps between the ni1 sequences from donor 1 and the ni2 sequences from donor 2. The model parameters (a, b) are chosen to minimize the error between the observed number of overlaps and the regression model with the nonlinear least-squares method. The colored symbols in Fig. 4 represent the observed pairwise overlap data (Xi, Yi), where Xi = ni1 × ni2, between the naïve CD8+ compartments of donor 3 and the other six donors; the black lines show the fit of the regression model, Y = aXb. The R2 values for all of the fitted curves are greater than 0.998, which demonstrates that the model accurately fits the observed data. We previously identified ~1 × 106 as a lower bound for the number of unique TCRβ CDR3 amino acid sequences in the naïve CD8+ T cell compartment of a healthy adult (2). To estimate the minimum overlap between the entire naïve CD8+ repertoires of two donors, we compute Y = aXb for i = 1 × 106 and therefore, Xi = ni1 × ni2 = 1 × 1012. These values for the pairwise overlaps between the naïve CD8+ compartments of donor 3 and the other six donors are indicated at the right hand of the graph in Fig. 4. Figure S2 shows an analogous set of calculations for the TCRβ CDR3 sequence overlaps between the naïve and the memory CD8+ compartments of the seven donors. To estimate the overlap between the naïve and memory CD8+ compartments of a single donor, however, we use i = 1 × 106 for the naïve compartment and i = 3 × 105 for the memory compartment, because the lower bound for the number of unique TCRβ CDR3 amino acid sequences in the memory CD8+ compartment is ~3 × 105 (2). The calculated overlaps between the naïve and memory CD8+ TCRβ CDR3 sequence repertoires for the full set of 7!/(2!5!) = 21 pairwise comparisons are summarized in Fig. 5A.

Evaluation of correlation between observation of public EBV-associated TCRβ CDR3 sequences and expression of the corresponding class I MHC allele

We use a two-sided Fisher’s exact test to test the hypothesis that there is a correlation in our sequence data between the observation of 12 public (that is, found in multiple individuals) EBV-associated TCRβ CDR3 amino acid sequences reported in the literature and the expression of their associated class I MHC–restricting elements. Previous studies have identified at least 11 public TCRβ CDR3 amino acid sequences used by CD8+ T cells specific for EBV-encoded peptides presented by HLA-A*0201 (7), and 1 public CDR3 sequence specific for an EBV-encoded peptide presented by HLA-B*0801 (6). The seven individuals in our study included two individuals expressing HLA-A*0201 and one individual expressing HLA-B*0801 (Table 1). Thus, there are 2 × 11 = 22 CDR3 sequence–HLA-A*0201 combinations and one CDR3 sequence–HLA-B*0801 combination possible in our data set that would support a correlation between the observation of a public EBV-specific CDR3 sequence and the expression of the associated MHC-restricting allele. Similarly, there are 5 × 11 = 55 possible CDR3 sequence–HLA-A*0201 combinations and 6 × 1 = 6 possible CDR3 sequence–HLA-B*0801 combinations, for a total of 61 combinations, that would not support a correlation. We observed five of the public EBV-specific CDR3 sequences associated with HLA-A*0201 in the memory CD8+ compartments of one or both of the HLA-A*0201+ donors, but none of the five HLA-A*0201 donors, and observed the public EBV-specific CDR3 sequence associated with HLA-B*0801 in the memory CD8+ compartment of the sole HLA-B*0801+ donor but in none of the HLA-B*0801 donors. Thus, 6 of the 23 possible public CDR3 sequence–MHC allele combinations that would support a correlation, and none of the 61 possible public CDR3 sequence–MHC allele combinations that would not support a correlation, were observed in our data (table S1). To test the null hypothesis of no correlation, we use the Fisher’s exact test (table S2), which calculates a P value of 0.0002, and thus, we confidently reject the null hypothesis.

Supplementary Material

www.sciencetranslationalmedicine.org/cgi/content/full/2/47/47ra64/DC1

Fig. S1. Model of TCRβ VDJ rearrangement used to calculate the exact number of unique TCRβ CDR3 amino acid sequences containing specified numbers of total nucleotide insertions at the Vβ-Dβ and Dβ-Jβ junctions.

Fig. S2. Calculation of the total number of TCRβ CDR3 amino acid sequence overlaps between the naïve and the memory CD8+ compartments of individual donors.

Fig. S3. Cumulative distribution of the unique TCRβ CDR3 amino acid sequences in naïve and memory CD8+ T cells observed in the seven donors as a function of the total number of nucleotide insertions at the Vβ-Dβ and Dβ-Jβ junctions.

Fig. S4. Observed relative frequencies of specific TCRβ CDR3 amino acid sequences in the naïve and memory CD8+ compartments of each of the seven donors.

Table S1. Identification of previously described public TCRβ CDR3 sequences associated with CD8+ T cell responses to EBV in HLA-A*0201+ or HLA-B*0801+ individuals.

Table S2. Fisher’s exact test to test the hypothesis of a correlation between the observation of public EBV-associated TCRβ CDR3 sequences and the expression of the corresponding class I MHC allele.

References

Footnotes

  • These authors contributed equally to this work.

  • Citation: H. S. Robins, S. K. Srivastava, P. V. Campregher, C. J. Turtle, J. Andriesen, S. R. Riddell, C. S. Carlson, E. H. Warren, Overlap and effective size of the human CD8+ T cell receptor repertoire. Sci. Transl. Med. 2, 47ra64 (2010).

References and Notes

  1. Acknowledgments: We thank M. McIntosh and S. McCarroll for helpful discussions. Funding: This study was supported by grants from NIH (CA106512, CA015704, and DK056465), the Fred Hutchinson Cancer Research Center Interdisciplinary Dual Mentor Fellowship, the Thomsen Family Fellowship, and a gift from the Bob and Pat Herbold Foundation. Author contributions: H.S.R., P.V.C., S.R.R., E.H.W., and C.S.C. designed the research; J.A. recruited subjects, acquired written informed consent from subjects, and arranged for collection of blood samples; C.J.T. purified naïve and memory CD8+ T cells from peripheral blood mononuclear cells by fluorescence-activated cell sorting; P.V.C. extracted genomic DNA from T cells and constructed the sequencing libraries; H.S.R., S.K.S., C.S.C., and E.H.W. analyzed results and made the figures; H.S.R., S.K.S., C.S.C., and E.H.W. wrote the paper. All authors discussed the results and commented on the paper. Competing interests: The Fred Hutchinson Cancer Research Center, on behalf of H.S.R., C.S.C., and E.H.W., has applied for a patent on the method for high-throughput sequencing of TCRβ CDR3 sequences used in this report.
View Abstract

Navigate This Article