The Accuracy of the Single Intradermal Comparative Skin Test for the Diagnosis of Bovine Tuberculosis-Estimated from a Systematic Literature Search

None of the currently available diagnostic tests can be considered as having both 100% sensitivity and 100% specificity, i.e. to be a perfect gold standard being able to determine accurately the infection status for each animal tested. In the UK, the Department for Environment, Food and Rural Affairs (DEFRA) claims a specificity of the single intradermal comparative cervical tuberculin test (SICCT) of 99.9% [12]. In the case of diseases with a low prevalence and a test specificity of less than 100%, the number of false-positive individuals can be higher than the number of infected animals. Several reviews [13-15] and a large number of studies report a wide range of diagnostic performances of the SICCT. However, despite these studies there has been no systematic evaluation of the performance of the SICCT or indeed any of the tests used in the diagnosis of bTB. This is in contrast to the evaluation of diagnostic procedures for tuberculosis in humans [16]. A meta-analysis of field studies on bovine tuberculosis skin tests in United States cattle herds identified seven publications together for the caudal fold tuberculin test and the serial interpretation of the caudal fold and the comparative cervical tuberculin test published between 1953 and 2011 [17].


Introduction
None of the currently available diagnostic tests can be considered as having both 100% sensitivity and 100% specificity, i.e. to be a perfect gold standard being able to determine accurately the infection status for each animal tested. In the UK, the Department for Environment, Food and Rural Affairs (DEFRA) claims a specificity of the single intradermal comparative cervical tuberculin test (SICCT) of 99.9% [12]. In the case of diseases with a low prevalence and a test specificity of less than 100%, the number of false-positive individuals can be higher than the number of infected animals. Several reviews [13][14][15] and a large number of studies report a wide range of diagnostic performances of the SICCT. However, despite these studies there has been no systematic evaluation of the performance of the SICCT or indeed any of the tests used in the diagnosis of bTB. This is in contrast to the evaluation of diagnostic procedures for tuberculosis in humans [16]. A meta-analysis of field studies on bovine tuberculosis skin tests in United States cattle herds identified seven publications together for the caudal fold tuberculin test and the serial interpretation of the caudal fold and the comparative cervical tuberculin test published between 1953 and 2011 [17].
Typically culture is used as a confirmatory test for a SICCT positive animal. While diagnostic specificity of bacteriological culture maybe assumed to be 100%, diagnostic sensitivity is less than 100%, leading to a potential misclassification of samples, i.e. false-negative test results.
Although no perfect gold standard with both sensitivity and specificity of 100% for the diagnosis of bTB exists, it should be possible to obtain unbiased estimates of diagnostic performance, Historically Bovine Tuberculosis (bTB) was widespread in many European countries. Control efforts, which comprised the use of skin testing and subsequent removal of reactors, have led to a significant reduction in bTB and the recognition of a number of countries in the European Union (EU) as officially bTB free according to Directive 64/432/EEC [1]. bTB is monitored by the Directive 2003/99/EC [2] since Mycobacterium bovis is a zoonosis [3]. Control efforts, such as milk pasteurization led to a significant reduction in human M. bovis infections. Typically the proportion of human tuberculosis due to M. bovis is below 2% in countries with an official bTB control program [4,5]. Nevertheless, continued vigilance is obligatory. For example, M. bovis is considered to be amongst the zoonotic pathogens with the highest risk for the Netherlands with a low current but a high historic burden [6] and the need to maintain control measures for human and bovine tuberculosis is emphasized [7]. bTB is considered as a "reemerging disease" in several EU countries which are considered to be officially-free [8], leading to the discussion of re-introduction of skin testing as part of effective surveillance [9]. Some European countries have not been able to eliminate bTB and to obtain official bTB free status. In Great Britain in 2009 there were more than 2,400 new herd incidents and more than 36,000 animals slaughtered under bovine TB control measures [10]. In the years 2008 and 2009 more than 100 million £ were spent annually on bTB disease control in Great Britain. bTB is also a public health concern in some developing countries [11]. by performing multiple tests on the same animals and using a latent class approach. Latent class models with a Bayesian approach can then be used to estimate the diagnostic accuracy of the tests [18][19][20][21]. Bayesian approaches allow for the incorporation of prior knowledge. In contrast, frequents methods are solely based on data. Note that the prior knowledge in a Bayesian analysis can be un-informative. This procedure is already prescribed in the OIE manual for diagnostic tests with regard to validation and certification of new diagnostic tests [22]. Latent class analysis owes its name from the idea that the disease status for each animal is unobserved/unknown (latent) and needs to be recovered from the observed data. Multiple tests are used to improve the estimation of diagnostic accuracy. However, multiple tests might be conditionally dependent; for example if based on similar biological processes and ignoring such conditional dependency might lead to biased estimates. If conditional dependency is present, more parameters may be unknown than can be estimated from the data.

Mycobacterial Diseases
Within a Bayesian approach prior information and/or constraints can be used to assist with parameter estimation.
The aim of this study was to undertake a systematic literature search of published research on the diagnostic performance of tests used in the diagnosis of bTB including the single comparative cervical skin test as described in 64/432/EEC with data from published studies. Suitable data from such literature was then used in a latent class analysis to obtain estimates of the diagnostic performances of these tests.

Systematic literature search
We undertook a systematic review, similar to [23,24], to find data suitable for assessing the diagnostic accuracy of the single cervical comparative skin test. We searched in PubMed, Agricola, Biosis, Medline and Web of Science from 1986 to 2009. The combination of search terms included the following "bovine tuberculosis", "bovine tb", "Mycobacterium bovis", "tuberculin or intradermal or skin", "test or assay", "interferon-gamma or bovigam". Our search strategy was formulated to identify all available primary studies published in English, French, German and Spanish containing data which could be used for latent class analysis. The following studies were excluded: (a) studies with less than 20 animals to allow for robust estimates, (b) studies with animals other than Bos taurus to avoid introducing heterogeneity due to potential difference in neck skin thickness (c) studies with less than three different tests for the same animals tested to comply with the Hui-Walter paradigm (for one population), (d) studies with follow-ups only for animals tested positive with the skin test and studies based on populations with presumably only diseased or non-diseased animals, (e) studies with another skin test cut-off to that described in 64/432/ EEC (f) studies with experimental infection or BCG vaccination, (g) studies focusing on other interfering factors on test accuracy (e.g. Dexamethasone, Paratuberculosis and Fasciola hepatica).

Latent class analysis
Analysis was performed on data extracted from the systematic review using a model for four tests allowing for conditional dependence and prior information to be incorporated. In the case of 4 tests with unknown test accuracies in addition to the prevalence, 4 sensitivities and 4 specificities, there are additional covariance terms. For the simplest case there would be a total of 12 two way covariance terms. For reasons of parsimony higher order terms were not considered. Under the assumption that all bacteriologically confirmed M. bovis are truly M. bovis, we set the specificity of culture equal to 1. Assuming culture specificity of 1 and conditional independence from the other 3 tests this leaves potentially 14 parameters to estimate including two way dependence structures of the tests. The presence of conditional dependencies between tests was checked by assessing separately the impact of pairs of covariance terms (conditional on a subject being disease positive or disease negative, beta distribution (1,1)), compared to a covariance term set to 0 on the other estimates. Model selection was performed by monitoring the Deviance Information Criterion (DIC) and by the effective number of parameters (pD) in the fitted model [25] where a lower DIC and a higher pD indicated a better model fit. Models were fitted using Markov Chain Monte Carlo sampling in the software Open Bugs version 3.2.1 [26]. Model diagnostics was performed by visually checking the convergence of three independent chains and by using the usual Gelman-Rubin diagnostics [27]. For technical details please refer to the supplementary file.

Systematic literature search
Of the 375 studies identified in Pub Med and the 261 studies identified in Agricola, Biosis, Medline and Web of Science, after removal of duplicates, 112 full-text papers were screened by the two authors and just 1 met our eligibility criteria [28]. This finding precluded a quality assessment, a typical part of a systematic literature review. In most studies, diagnosis of bTB is confirmed bacteriologically and/or by post mortem examination. Many studies provide data of only a subset which is followed-up or assess diagnostic test accuracy solely in assumed diseased or diseased-free populations, or have included small numbers of animals. Very few studies have reported essential relevant information about study design such as independent and blinded interpretation of the different test results.

Latent class analysis
The study from Liebana 2008 [28] is focused on the pathology of naturally occurring bovine tuberculosis. Four diagnostic tests were used for 400 animals: single intradermal comparative cervical tuberculin test, post-mortem detection of visible lesions, histopathology and bacteriology. For detection of visible lesions ("gross lesion detection", lymphatic tissues at different anatomical sites (head: eight lymph nodes, two tonsils, chest: five lymph nodes, abdomen ten lymph nodes and six other lymph nodes) and all lung lobes were used. In animals with visible lesions, a standard panel of samples from 16 sites (all head, thoracic and mesenteric), as well as the lungs or any other lymph nodes with visible lesions were examined histologically and bacteriologically. In animals without visible lesions, four different pools of lymph nodes (head, chest, mesenteric and tonsils) and any other suspicious lesion were submitted to bacteriology and histopathology. Due to the study design 200 animals were reactors (skin test positives) and 200 animals were in-contact animals originating from 242 farms. The latent class model applied to Liebana's data was constrained by setting the specificity of the culture to 1. Conditional dependencies between sensitivities and specificities of histology and lesions were parameterized with flat uniform beta distributions (1, 1) as priors. Density distributions for all diagnostic accuracies are shown in figures 1 and 2. Estimates for the prevalence, the diagnostic accuracies, the covariance between visible lesions and the histopathology are presented in the table (Table 1).

Discussion
By conducting a systematic literature search we found a very low number of eligible studies which examined bTB test diagnostic. This could be due to the narrow focus of our research question. However, the variant of the skin test is a standard test in the EU and therefore our inclusion criteria are arguably appropriate. This is evidence that, despite the widespread use of diagnostic tests in compulsorily official disease elimination programs, few have been rigorously evaluated in terms of their performance.
To our knowledge only one paper uses a latent class approach in order to estimate sensitivity and specificity of the single intradermal comparative cervical skin test at different cut-offs (for zebu cattle in Chad) [29] and another paper uses a latent class approach to estimate test accuracies of the single intradermal test used in Spain [21]. Different cut-offs for skin thickness used in different countries clearly shows that there is no unambiguous interpretation of the test's results.
There is evidence that diagnostic studies with methodological shortcomings, such as evaluating tests in a diseased population and using a separate control group (with non-representative individuals/ subjects), or interpretation of the test result with knowledge of the reference test, may overestimate the accuracy of diagnostic tests [30]. Sensitivity and specificity may vary with the population that is being tested [31]. Although a number of factors influencing the diagnostic accuracy of the skin test are well known [32], attempts to estimate, empirically, diagnostic performances in specific populations or settings not relying on a gold standard approach are scarce.
The problem of a missing perfect gold standard in bTB diagnosis (as it is well known that the sensitivity of culture is less than 100 % when used as confirmatory test) may be partially overcome by using a latent class and Bayesian approach, which is already recognized by the OIE. If conditional dependence between diagnostic tests is assumed to be present, more parameters need to be estimated than possible only using the data, leading to the necessity to use prior information. This procedure of using prior information and constraining the model is subjective, but intuitively justifiable. If the prior information is justified and the rationale for its use is given, this may be more appropriate than assuming a gold standard. In our analysis the specificity of culture was set to 100%, which is in line with the assumption that false-positive results of bacteriological culture, if performed lege artis, are not possible. By fixing this parameter it eliminated several of the covariance terms and allowed an identifiable model.
The merit of the study from Liebana (2008) lies in its attempt to describe accurately pathology of field bovine tuberculosis and to obtain a contemporary data set with 400 animals. With regard to a Bayesian latent class approach, considering the possibility of conditional dependencies between test results seems plausible for this study since detection of visible lesions might also facilitate the detection of typical histopathological lesions.
Bayesian latent class approaches allow for the removal of the unrealistic assumption that tests are gold standard. From a pathological perspective, however, a cellular response will be detected earlier in time than bacteria or post mortem lesions and such time-dependency issues would be challenging to be routinely included in studies. In virtually all studies, culture or a combination of culture/lesions/PCR was considered as a perfect gold standard (also without consideration of the time course of the infection and the subsequent detectable immune response).
Our analysis of the data from Liebana (2008) with a median of 65% for the specificity for the skin test clearly shows a discrepancy to the 99.99% as cited by a leaflet for farmers from DEFRA [12] without giving a reference. High specificities for the skin tests are also described in reviews [13][14][15], all citing a paper from Lesslie [33] published in 1975. The tuberculins used at that time might differ from those used nowadays due to efforts in standardization. Lesslie himself does not estimate the specificity to be 99.9%. It may be concluded from the data given in the paper from Lesslie that the specificity is 99.71 with the upper confidence limit of 99.79. The author himself adds a note of caution "these results cannot be considered as a true indication of the false positive errors of the tuberculin tests".
Estimates of the specificity of the SICCT test used in the elimination program across the UK can be made using data published by DEFRA [34]. The maximum number of false positives nationally would be those animals that were SICCT positive, but from which M. bovis was not successfully cultured (i.e. not confirmed).    it does illustrate that false positives certainly do occur and given that the specificity of tests can and does vary with the populations they are tested on, in certain herds the specificity of the SICCT test may sometimes be lower than that estimated from national screening statistics. However, our results do demonstrate that the culture of M. bovis at postmortem appears to be highly sensitive, although the study design of Liebana's study suggests that a more rigorous approach to finding the lesions and the organism was undertaken than might be possible in routine post mortem examination. Nevertheless our analysis lends substantial credence to the hypothesis that animals that are positive to other tests such as the SICCT or the more recently introduced bovigam test are likely to be false positives if no lesion or bacteria are isolated from the animal following slaughter. Given the evidence in our analysis this is much more likely than the generally accepted dogma that failure to isolate the organism is due to poor sensitivity of culture rather than problems with the specificity of these two tests [13].
Our analysis with data from Liebana's study [28], based on a Bayesian approach, was done in accordance to previously published work [19,20] and fulfilled model checking criteria. The aim of this study was to deliver a contemporary data set in tuberculin reactor and in-contact animals. Thus our results cannot be seen to defining the specificity of the SICCT test in the official bovine TB elimination program as the data set included positives selected by this test and hence would have selected a much higher proportion of false positive SICCT reactors than would be expected in routine surveillance. However, this analysis confirms the widely held assumption that sensitivity of the SICCT test is somewhat less than 100% [13][14][15]. Furthermore it can be argued that it gives an unbiased estimate of the performance of the diagnostic parameters of culture, lesions and histopathology. What is clearly lacking are properly designed studies to evaluate the diagnostic performance of the SICCT test and other tests such as the relatively recently introduced bovigam test [13].
The highest herd prevalence in the EU is reported for Ireland with 5.9% of herds infected [35]. In countries with a control program in place the within herd prevalence is assumed to be low with only a small number of animals infected per herd [9]. The skin test today is also used on individual animals (e.g. pre-movement testing) not only in order to classify a herd, but also to classify the individual animal which will be slaughtered for diagnostic reasons in case of a positive skin test result. Our analysis cautions against the re-introduction of nation-wide skin testing in its current form in countries officially free of bTB. False positives would occur, possibly at a greater incidence than true positives as the infection is likely to be very rare in such countries. However rigorously applied methods for culturing the organism have potentially a high sensitivity and could be used to ameliorate this problem.
To our view, the test and cull program is of little public health benefit, and the economic benefit to animal health has not been proven [36]. Welfare of affected farming families might be adversely affected by the actual TB control program [37]. However, to maintain confidence in the program from the perspective of the farmers and other stakeholders who are affected by official disease elimination programs, confidence in the testing regimen should be high. An alternative option to test and cull might be to introduce vaccination of cattle and estimate the diagnostic performance of SICCT within the development of a DIVA concept (Differentiate Infected from Vaccinated Animals) [38]. However, diagnostic test evaluation in a target region and population is a prerequisite for an effective control program. It is also clear from the work presented that despite the large investment in bTB elimination programs globally; there are very few if any studies that have attempted to define the diagnostic performances of key tests in a rigorous manner. These results provide data against which new tests can be evaluated. More accurate testing results will improve the consumers' confidence in a program. Inaccurate or worse unknown accuracy of diagnostic tests together with lack of stakeholder confidence in the program will contribute to the difficulties in eliminating this disease.