ReportsRegulatory Science

Impact of guidance documents on translational large animal studies of cartilage repair

See allHide authors and affiliations

Science Translational Medicine  21 Oct 2015:
Vol. 7, Issue 310, pp. 310re9
DOI: 10.1126/scitranslmed.aac7019

Advise and consent

Cicero1 was wrong: Sometimes someone CAN give you wiser advice than yourself. In the case of preclinical study design, sound advice can come from those who regulate the next steps in translation. For example, new therapy development for cartilage repair relies on costly preclinical studies in large animal models—horses, pigs, sheep, and others—for translation to human clinical trials. To ensure the efficacy and safety of these therapies before testing in human subjects, regulatory agencies such as the U.S. Food and Drug Administration and the European Medicines Agency have published guidance documents that outline best practices for study design and execution as well as appropriate metrics for data analysis. These documents were expected to influence the design and execution of large animal studies. Here, Pfeifer et al. put that hypothesis to the test with a systematic review of more than 100 publications stretching back two decades.

Surprisingly, and somewhat disappointingly, the availability of the documents appears to have little effect on adherence by researchers to the recommendations therein. Although overall adherence has been slowly increasing over time, the authors observed no correlation with the publication of guidance documents. In fact, the authors subjected one of their recent studies to the same analysis and found that it fared no better than average for the industry when it came to adherence to the regulatory guidance documents. Possible explanations for poor compliance include access to necessary infrastructure and expertise, differing intents for the data, and expense. The authors call for an increased effort by the field to conform to the guidance documents and for regulatory agencies to provide additional resources to facilitate the transition.

1Cicero (106 BC–43 BC) was a Roman philosopher credited for counseling that “Nobody can give you wiser advice than yourself.”


Promising therapies for cartilage repair are translated through large animal models toward human application. To guide this work, regulatory agencies publish recommendations (“guidance documents”) to direct pivotal large animal studies. These are meant to aid in study design, outline metrics for judging efficacy, and facilitate comparisons between studies. To determine the penetrance of these documents in the field, we synthesized the recommendations of the American Society for Testing and Materials, U.S. Food and Drug Administration, and European Medicines Agency into a scoring system and performed a systematic review of the past 20 years of preclinical cartilage repair studies. Our hypothesis was that the guidance documents would have a significant impact on how large animal cartilage repair studies were performed. A total of 114 publications meeting our inclusion criteria were reviewed for adherence to 24 categories extracted from the guidance documents, including 11 related to study design and description and 13 related to study outcomes. Overall, a weak positive trend was observed over time (P = 0.004, R2 = 0.07, slope = 0.63%/year), with overall adherence (the sum of study descriptors and outcomes) ranging from 32 ± 16% to 58 ± 14% in any individual year. There was no impact of the publication of the guidance documents on adherence (P = 0.264 to 0.50). Given that improved adherence would expedite translation, we discuss the reasons for poor adherence and outline approaches to increase and promote their more widespread adoption.


Articular cartilage defects are common (14) and, when left untreated, can lead to disabling joint disease (5). Hence, there has been considerable focus on developing new treatments. The field of regenerative cartilage therapeutics has evolved substantially since the advent of bone marrow stimulation and abrasion arthroplasty techniques (6, 7) to include both autograft and allograft osteochondral transplantation (8, 9) and the widespread use of cell-based autologous chondrocyte implantation (ACI) (10). Likewise, regenerative medicine–based approaches have been developed to improve on these techniques [for review, see (11)], including in vitro–grown engineered constructs (1214), biomaterial-based cell delivery systems (for next-generation ACI), and materials that serve as adjuvants to repair (used in conjunction with microfracture), with several “first-in-human” clinical trials of new repair technologies reported over the last few years (1517). The appearance of these reports speaks to the intense innovation occurring within this domain to provide both functional and durable repair of cartilage injuries.

In most (if not all) cases, translation of emerging regenerative approaches to clinical practice involves preclinical evaluation in an animal model. Large animal models (in particular, equine, ovine, caprine, porcine, and canine) are commonly used for final (pivotal) preclinical studies. Critical evaluation of new technologies in large animals can highlight the safety and efficacy of new therapies and also redirect designs to improve efficacy when preliminary versions fail. A number of recent reviews outline the strengths and weaknesses of these various large animal repair models (1821).

Beyond selecting a species in which to perform the pivotal trial, investigators must carry out a well-performed study that will be accepted by their peer community and by the regulatory agencies that will consider data generated toward the initiation of a human clinical trial. To that end, several resources exist to orient investigators to best practices, including guidance documents published by governing agencies including the U.S. Food and Drug Administration (FDA; first draft 2007) (22) and the European Medicines Agency (EMA; first draft 2008) (23), as well as expert panel reports by the American Society for Testing and Materials (ASTM International; first draft 2005) (24) and the International Cartilage Repair Society (ICRS) (20). These guidance documents provide specific details related to study design, desired time points, and endpoint analyses to guide the proper execution of the study (tables S1 and S2). In each document, however, there is room for interpretation, acknowledging ongoing developments in the field and the difficult decisions that are made in carrying out such studies. For example, the language on an appropriate defect location is vague, recommending that investigators choose an “orthotopic” (23) or “similar to human” (22) site (see table S1). Furthermore, these documents provide guidance but are not per se mandates, and so investigators can make their own decisions as to the appropriate study design and the type and number of outcome assays used.

Given the maturation of the field of cartilage repair as well as the recent publication of these guidance documents, we were curious as to the extent to which the field had incorporated these recommendations into their work. To answer this question, we performed a systematic review inclusive of the most common large animal models, capturing data from full-length peer-reviewed publications over the last two decades (1994 to 2014, see the Supplementary Materials publication listing). We then reviewed the extant literature that met our inclusion criteria (studies reporting on repair of chondral defects in large animal species) and used quantitative analysis to determine the degree to which these studies adhered to the guidance document recommendations.

Our overall hypothesis was that the field as a whole would show a quantifiable improvement in adherence to the guidance documents over the time period considered. Given that these documents appeared during the period of our review (for example, 2005, 2007, and 2008 for ASTM, FDA, and EMA, respectively), we postulated that there would be an inflection in the trajectory of adherence, with greater adherence after the publication of the guidance documents. We further expected that those studies with longer-term time points and those using larger species (both of which would increase study costs) would show a greater adherence to these published criteria, because these would more likely represent pivotal rather than preliminary studies.


The initial review of the extant guidance documents defined categories for subsequent analysis. These are defined in table S1 for study descriptors and table S2 for study outcomes, according to the governing agency defining the standard. Those agencies that focus on standardization of measures (for example, ASTM) provide the greatest detail in terms of execution of outcome assays, whereas other publications (for example, EMA) were intended as guidelines and so provide less-specific details (Fig. 1).

Fig. 1. Recommendations from regulatory agencies for appropriate descriptors and outcomes for translational large animal studies in cartilage repair.

Color scheme indicates whether the documents provide detailed (green), loose (yellow), or no (red) recommendations in a category.

Next, we used our established inclusion/exclusion criteria to identify peer-reviewed studies for consideration. Our search identified 1187 articles on cartilage repair in large animals published as full-length manuscripts over the last 20 years (see Materials and Methods). Application of exclusion criteria reduced the total number to 123 for analysis. Of these, nine were not eligible upon detailed review. Thus, 114 studies were included in this analysis, of which 20 were conducted in horses, 23 in sheep, 18 in goats, 41 in (mini-)pigs, and 12 in dogs (Fig. 2).

Fig. 2. Literature search, screening, and eligibility testing using the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines (29).

All studies were reviewed and scored by one of the three reviewers. When questions arose, each of the three graders reviewed and discussed the publication, and a consensus score was assigned. Scores for all studies considered (columns) and their degree of adherence to each of the 24 categories (rows) were visualized by color mapping the resulting scores for each study (Fig. 3). These were further analyzed by calculating the percentage of studies reporting on each study descriptor and study outcome.

Fig. 3. Array depiction and quantification of adherence to guidance document recommendations.

Each row indicates a different category (top, descriptors; bottom, outcomes), whereas each column represents an individual study. Green indicates full adherence, red indicates nonadherence, and yellow indicates partial adherence. Bars at right show average percentage adherence across all studies in a given category.

The number of published large animal cartilage repair studies ranged from 1 per year to a peak of 16 in 2009 (Fig. 4A). Overall adherence to guidance documents (that is, overall assessment) showed a weak positive trend with time (P = 0.004, R2 = 0.07, slope = 0.63%/year; Fig. 4B), ranging from 32 ± 16% to 58 ± 14% for a single year. However, there was no effect of the publication of the guidance documents on adherence (documents published in 2005, 2007, and 2008). That is, the effect of time on the level of adherence did not change in the period following any one of these publications (P = 0.26 to 0.50). The slope of adherence versus time decreased after issuance of the guidance documents (0.89 to 1.11%/year before publication versus 0.07 to 0.48%/year after publication). Since 2003, most published studies had an overall adherence of ~50%.

Fig. 4. Study number and adherence as a function of year, species, study duration, and number of animals.

(A) Number of studies per year (*through 18 April 2014). (B) Overall adherence by study and year. Vertical lines indicate the time of publication of the *ASTM, #FDA, and §EMA documents. Regression shows slight increase in overall adherence with time (P = 0.004, R2 = 0.07, slope = 0.63%/year). (C) Mean adherence per year, broken out into overall, study descriptors, and study outcomes. Study descriptors were reported at a higher level than study outcomes. Standard deviation is shown for overall adherence only. (D) Adherence (in percent) across species (white bars, study descriptors; hatched bars, study outcomes; black bars, overall adherence; *P < 0.05). (E and F) Correlation between overall adherence and study duration (R2 = 0.075, P = 0.003) (E) and overall adherence and the number of animals (R2 < 0.001, P = 0.994) (F).

We also considered adherence in each subset of categories, study descriptors (11 categories) and study outcomes (13 categories). Not surprisingly, this analysis showed that adherence to study descriptors was, without exception, higher than adherence to study outcomes (Fig. 4C). The only category not well covered in the subset of study descriptors was relative lesion size (<5% of all studies). This is an important category on which to report given that relatively small lesions may occupy a significant portion of the joint in some species (25). Of the 13 study outcomes, histological analysis was most commonly reported, with 97% of studies reporting on this category. Note, however, that there was little consensus on the histological methodology used, with some studies simply reporting very basic histology and others performing multiple stains and visualization by advanced imaging modalities (for example, polarized light). Likewise, some studies provided semiquantitative scoring of histological outcomes, whereas others did not, and those that did provide semiquantitation used a number of different scoring systems. Of the remaining study outcomes, only gross view (80%) and defect fill (60%) were reported by >50% of the studies examined. As with histology, a variety of qualitative and semiquantitative assessments were used. All other study outcomes fell below 50% adherence, with follow-up arthroscopy (reported by <8%) and gene expression (reported by 11%) being the two least reported.

When these metrics were analyzed across species, a similar pattern was observed (Fig. 4D). In general, findings showed that for overall adherence, equine > canine > ovine = caprine > porcine. For all species, study descriptors showed a higher level of adherence than the study outcomes. Somewhat in keeping with one of our initial hypotheses, studies that used horses showed a slightly higher score in terms of overall adherence (59 ± 12%) and study outcomes (50 ± 20%) compared to other species. These differences were only significant in comparison to (mini-)pigs, however, where overall adherence was 49 ± 10% (P < 0.001, versus equine) and study outcome adherence was 30 ± 12% (P < 0.01, versus equine). No other differences were observed among species.

Another of our original hypotheses was that studies that were longer or larger would have better adherence. Contrary to this hypothesis, the correlation between duration and overall adherence was 0.27 (with a P value of 0.003), indicating a weak correlation (Fig. 4E). Likewise, the correlation relating the number of animals used in a study to overall adherence was 0.001 (P = 0.994), suggesting that no correlation existed between these factors (Fig. 4F). Given the labor and cost involved in studies using a large number of animals for a long duration, it was surprising to find such weak correlations. Additional data comparing numbers of animals used, unilateral versus bilateral surgeries, numbers of groups per study, samples per group, and defects per joint across species are presented in the Supplementary Materials.


Activity in the field of cartilage repair and regeneration is well evidenced by the increase in number of translational cartilage repair studies over the last two decades (Fig. 4A), as well as several recent first-in-human reports. Guidance documents published by the U.S. and European regulatory agencies provide direction for performing animal studies (20, 2224), a framework in which to validate safety and efficacy before human trials, and should enable better comparison between studies to expedite progress in the field. Given that the first of these guidance documents was published ~2005, our overriding hypothesis was that the field as a whole would show a quantifiable improvement in reporting on the study design and outcome criteria outlined in these documents. To assess this, we performed a formal systematic review and also extracted quantifiable adherence data from the past two decades of large animal research in terms of study descriptors and study outcomes. In doing so, we tested the hypothesis that there would be an inflection in the trajectory of adherence to these guidance documents based on the timing of their publication. On the basis of our experience with such model systems, we further expected to find that those studies with longer-term time points and those that used larger species and a greater number of animals (both of which would increase study costs) would show a greater adherence to these published criteria.

Unfortunately, our initial hypotheses were generally not supported by the data extracted from the literature. Although there has been a slow increase in overall adherence to the guidance criteria over the past two decades (with recent studies reaching 50 to 60%), there was little impact (that is, change in trajectory) as a consequence of the publication of the guidance documents. Even more surprising, there seemed to be only weak or nonexistent correlations between overall adherence and the length of the study or cost of the study animals used. On the basis of the publications analyzed and the methodology used in this study, it is clear that the field has not responded to the publication of the guidance documents, and so there is little homogeneity in the reporting of these studies and the outcome assays that are being used.

To probe the data set a bit further, we subdivided the analysis into study descriptors and study outcomes, in which the former enumerated basic information regarding the animals and surgical procedures and the latter summed the number of outcome assays used with respect to those suggested by the guidance documents. This analysis showed that the study descriptor categories were in general met at a rate of ~75%. It was surprising that the rate of adherence was not higher and suggested to us that there is simply a lack of standardization in the field in terms of reporting on these parameters. Improvement in adherence in these categories could be achieved if authors simply reported on the age, weight, and gender of the animals used; these last two parameters were reported by <50% of the studies analyzed but can have a major influence on cartilage repair potential (in humans) (2628). In addition, in a complicated study design, the overall study is best understood when visualized in a flow chart, as is commonly used in clinical studies (29). The above are standard components of any animal protocol but are not generally included in the extant literature. One recommendation for the field might be to better adhere to the ARRIVE (Animals in Research: Reporting In Vivo Experiments) guidelines for publication (30), with select components of approved animal protocols (such as flow diagrams, tables of weights, gender, and groups for analysis) provided with each published animal study (as supplementary material), so that the full complement of information is easily accessible to the community while not making published methods burdensomely long.

Some descriptors that directly influence experimental outcomes were reported on but done so with insufficient detail so as to enable replication. For instance, most studies reported some aspect of postsurgical rehabilitation, but few provided detailed descriptions (with the exception of those studies performed in horses), despite this having a profound impact on outcomes in both humans and animals (31). In addition, the description of the lesion depth and type was often vague or missing and did not always specifically differentiate between partial chondral, full chondral, and osteochondral defects [as described by Cucchiarini et al. (32)]. If the technique and visualization allow, the description of the defect creation should also carefully describe whether the calcified cartilage was removed and if bleeding into the defect was present—again, these are rarely noted in published studies despite their impact on regeneration.

When it came to study outcomes, adherence to the guidance documents was by far less robust, with most studies falling in the range of ~40% adherence. Indeed, only three categories (histology, gross visualization, and assessment of defect fill) were reported on by more than 50% of the studies we analyzed. Many other important categories (for example, biomechanics and biochemistry of the regenerate tissue) were reported on by fewer than 25% of these studies. In some categories, such as relative defect size and gene expression, only one of the documents recommended reporting on this parameter, and so low adherence here might reflect an overall lack of consensus as to the value of such outcomes. Alternatively, some assays, such as follow-up arthroscopy, are not possible in all species, and so poor adherence in this category might reflect the technical challenges inherent to some species. Even with these caveats, however, our findings suggest that there is very little consensus in the field as to the most important study outcomes to measure and report in these large animal models.

Do as they say, not as we do

Although it seems logical to follow the guidelines set forth by regulatory agencies (“do as they say”), researchers appear to largely ignore these recommendations (“not as we do”). The gulf between what they say we should do and what we actually do is quite large. Subjecting our own recent large animal model of cartilage repair (33) to this same analysis shows that our study fares no better than the industry standard. For our recent cartilage repair study in minipigs, we achieved an overall adherence of only 56%; whereas our adherence to study descriptors was a respectable 73%, our adherence to study outcomes was a lowly 42%. Given that adherence to study descriptors are generally at an acceptable level and because improvements in this area could be easily made, we will focus our recommendations on study outcomes, in which a change in approach will be required for significant improvements to be made across the field.

There are several potential reasons for a lack of adherence in terms of study outcomes. First and foremost, not every study group has access to the infrastructure and expertise needed to conduct certain measurements. Indeed, those outcomes that are relatively easy and inexpensive to perform (histology and gross view imaging) are well covered across studies, whereas those requiring specialized expertise (for example, mechanical testing) and equipment [for example, follow-up magnetic resonance imaging (MRI)] are underrepresented. A solution to this infrastructure and expertise problem might be to build better collaborations and research networks or to create designated federally funded core facilities for the performance of these more advanced assays.

The second (and simple) reason for the low adherence rate might be that not every study has the same intent (for example, some being considered to be “pilot” and others “pivotal”). In pilot studies [including our own (33)], one naturally focuses on only one or a few outcomes within the expertise of the particular group and most predictive (in the opinion of the group) of success in experimental groups to be pursued in future studies. Such pilot studies are critical steps in advancing any technology and certainly need to be performed. However, it is still crucial to extract as much useful information as possible by increasing the number of outcome measurements obtained from each specimen in a nondestructive manner. Further, development and use of tools for repeated measures (for example, MRI and arthroscopy) on the same animals might improve the depth of reported outcomes while not overly increasing costs.

Another consideration is that different cohorts (basic scientists, engineers, and clinicians) likely have differing opinions as to which of the outcomes is most important. The guidance documents are under constant review and improvement, and new insights from the field are continually incorporated (34). Given that so many outcome measures are simply not reported at present, it might well be that these documents should be refined further to develop a minimal set of recommended outcome measurements that is agreed upon (and followed) by all stakeholders.

The final, and perhaps the most important, factor defining the overall level of adherence is of course the amount and source of funding. Large animal trials are expensive, and not every group has the resources to analyze the full set of recommended outcomes even though the marginal costs for performing additional assays might be small compared to the animal study cost. In some cases, funding might be available only for a pilot trial, whereas in other cases, funding might be available for a pivotal study with the intent of translating the findings to humans. This is not to say that pilot trials should not be done. Rather, these studies should be performed and defined, perhaps with a different set of criteria for moving technology to the next and more rigorous (pivotal) level. An additional complication is that when studies are industry-sponsored, there is the added element of financial pressure, and not every industry-sponsored study is intended for open publication in the literature (limiting our ability to capture these data). Here, the need to commercialize a product as soon as possible might set objectives at the level of providing just enough evidence to regulatory bodies (but not to the general public) so as to meet the bar for market admittance and initiation of a human clinical phase 1 trial. This is in no way a judgment, but rather speaks to the reality of the world we live in, wherein timing, competitive advantage, and economy might mean the difference between market success and bankruptcy.

To address these deficiencies, our summary recommendation for those performing large animal studies is to “do as they say, not as we do.” The guidance documents are quite robust, and all suggested outcomes have value and will accelerate progress if more routinely used. The community should convene and come to consensus on definitions of minimal requirements that identify pilot and pivotal studies, and both editorial and regulatory bodies should enforce these criteria in advancing publications or new products. To make this possible and to increase adherence overall, regulatory and funding agencies should provide additional resources to enable this more rigorous transition to translation. Coordinated effort and funding are necessary to improve outcomes and to expedite the development of new technologies that can provide functional repair for the many patients suffering from articular cartilage injuries.


Study design

For this systematic review, guidelines from three agencies were first synthesized to develop categories by which to assess the literature. Next, the literature on animal models to study cartilage repair was systematically reviewed and scored on the basis of their adherence to these categories (search methodology and inclusion/exclusion criteria are described below). Data generated were then analyzed statistically to assess the impact of publication of the guidance documents on whether these categories were reported on by the identified studies, as well as to assess species-specific differences in methods and reporting standards.

Synthesis of recommendations and categories for assessment

Three guidelines were considered: (i) the U.S. FDA Guidance for Industry: Preparation of IDEs and INDs for Products Intended to Repair or Replace Knee Cartilage (last accessed 30 April 2014) (22), (ii) the EMA Guideline on Human Cell–Based Medicinal Products (23) with International Standard ISO/EN 10993 (35) (last accessed 30 April 2014), and (iii) the ASTM International F2451-05 (2010) Standard Guide for In Vivo Assessment of Implantable Devices Intended to Repair Articular Cartilage (24). Recommendations from these documents (tables S1 and S2) were sorted as study descriptors (table S1) and “study outcomes” (table S2). A total of 24 categories were identified (11 study descriptors and 13 study outcomes).

Literature search and scoring

The literature search, screening, and eligibility testing followed the PRISMA guidelines (29). To retrieve published studies using the five most common large animal models [horse, sheep, goat, (mini-)pigs, and dog], we searched the PubMed database ( and the official journal of the ICRS “Cartilage” on 18 April 2014, using the search term “cartilage_ specific species name_repair” (see Fig. 2). The two primary inclusion criteria used were that the study be conducted in a large animal (dog, pig, goat, sheep, or horse) and that the defect be chondral only (that is, not including the subchondral bone). The former represented studies that were testing a mature technology in valid preclinical model, and the latter was enforced to focus (and restrict) our analysis to cartilage repair approaches (rather than approaches seeking to repair both cartilage and bone at the same time). After screening articles for inclusion and exclusion criteria and eligibility, a total of 114 studies were identified and reviewed for adherence. If a category was fully reported, the study was credited one point. If the category was not reported/measured, a score of 0 was assigned. If a category was partially reported, a score of 0.5 was assigned. Eligibility and scoring were reconciled by group discussion when not clear (that is, all authors read the manuscript in question and weighed in to assign the score). Points per study were calculated as percentages of “overall adherence” (all 24 categories included), “study descriptors” (11 categories), or “study outcomes” (13 categories).

Statistical analysis

Statistical analysis was performed using SPSS (version 20, IBM). Pearson’s correlation coefficients were computed for study duration versus overall adherence (α = 0.05). Analysis of variance (ANOVA) with Bonferroni post hoc testing (α = 0.05, two-tailed) was used to make comparisons for species-specific statistics after establishing normality of data (Kolmogorov-Smirnov test). To assess whether individual guidance documents affected adherence, analysis of covariance (ANCOVA) was performed, with a separate analysis for each document (that is, using years 2005, 2007, or 2008 as the grouping variable) using time as a covariate (α = 0.05, two-tailed). The interaction term between group and time was used to indicate if a particular document significantly affected adherence.


Supplementary analyses: Species-specific findings (methods, findings, and figures).

Fig. S1. Species-specific animals and study numbers.

Fig. S2. Species-specific study design and treatment parameters.

Table S1. Recommendations by regulatory agencies 1.

Table S2. Recommendations by regulatory agencies 2.

List of publications assessed in the systematic review and analysis.


  1. Acknowledgments: We gratefully acknowledge the many colleagues whose work provided the source material for this analysis, as well as those individuals who provided critical input and discussion as we analyzed this data and thought through its implications for the field. Funding: This work was supported by the AO Foundation, the U.S. NIH (R01 EB008722), the U.S. Department of Veterans Affairs (I01 RX000700), and the Deutsche Forschungsgemeinschaft (DFG). Author contributions: All authors contributed to the study design, data analysis, and discussion, as well as to the writing and editing of the manuscript. Study conception and design: C.G.P., R.L.M., M.B.F., and J.L.C.; data search: C.G.P. and R.L.M.; data analysis, scoring, and discussion: C.G.P., M.B.F., R.L.M., and J.L.C.; and manuscript writing, editing, and revision: C.G.P., R.L.M., M.B.F., and J.L.C. Competing interests: The authors declare that they have no competing interests.

Stay Connected to Science Translational Medicine

Navigate This Article