Stratification of risk of malignancy in atypical breast fine needle aspiration: A cytomorphological approach

Although atypical (C3) breast cytology is a legitimate reporting category, it is not very useful clinically because it yields benign as well as malignant histological outcomes. Therefore, a cytomorphological approach to minimise inappropriate use of the C3 category may improve the clinical usefulness of this category. In a previous training set of 180 atypical breast cytology (C3) cases we had identified the best discriminating cytological criteria to predict the histological outcomes (malignant, benign proliferative and benign non-proliferative). Using a selection of these previously identified statistically significant criteria (cystic background, cohesiveness, myoepithelial cells or bare bipolar nuclei, papillary fragments and tubules) we tested their ability to predict the same outcomes on 182 subsequent C3 cases (validation set). The diagnostic accuracy for each outcome was compared with the training set. A probability calculator with nominated cut-points was developed to stratify and re-grade C3 cases. Cases outside the cut-points were either up-graded to suspicious (C4) or down-graded to benign (C2). Re-graded cases were reviewed to identify reasons for over or under-interpretation. Statistical analysis showed comparable diagnostic accuracy between the training set and the validation set for malignant and benign proliferative outcomes but not for the benign non-proliferative outcome. The probability calculator resulted in an up-grade of 18% (33/182) ofthe cases. Malignant histology was seen in 25/33 (76%) of the up-graded cases but 8/33 (24%) produced a proliferative non-malignant outcome. A meaningful lower cut-point could not be established without including malignant cases. The C3 category remains a legitimate with a heterogeneous mix of pathological entities. Attempts to minimise inappropriate allocation of C3 cases into this category have met with limited success.


Introduction
Fine needle aspiration (FNA) is a wellestablished investigative procedure useful in the diagnosis of palpable and impalpable breast lesions. The results are reported using a five-tier categorical system based on probability of malignancy. This system complements the clinical and radiological BIRAD reporting systems and these modalities formulate thetriple test. The cytology categories include; C1 (non-diagnostic), C2(benign), C3(atypical), C4(suspicious) and C5 (malignant). C1 category usually reflects technical difficulties and provides little informative value. C2 and C5 have high predictive values as shown by many studies.The suspicious C4 category is not as certain as C5 but still suggests a high probability of malignancy. The C3 category is an equivocal category commonly used when there is some diagnostic uncertainty regarding the true nature of the lesion (1)(2)(3)(4)(5). There are no standard clinical management strategies for C3 cytology. Patients with C3 cytology and with low levels of suspicion clinically and on imaging are often observed and the FNA is sometimes repeated 3 to 6 months later. Immediate repeat FNA, core biopsy or even surgical excision is performed in C3 instances with higher levels of clinical or radiological concern (6, 7). Therefore determining the risk of malignancy within the C3 category may improve the clinical utility of C3 FNA (8).
We, like others have shown a heterogeneous mix of pathological outcomes during the follow up of C3 (5,(9)(10)(11)(12)(13)(14). In our earlier study, of 230 atypical (C3) cases, we found 37.4% to be malignant on follow-up and the remaining 62.6% yielded benign outcomes. We grouped the final outcomes into three mutually exclusive histological entities (malignant, benign proliferative and benign non-proliferative). These outcomes were established by correlating the histology in 87%, repeat FNA in 3.5% and clinical and radiological follow-up 2 years post index FNA episode in 9.5%. The main reasons for placing these cases into the C3 category were suboptimal in 34.8% of the cases and the remaining 65.2% were diagnostically challenging, making interpretation difficult (9,15).
To further our investigations, we tested the ability of a number of cytomorphological criteria to predict malignant, benign proliferative or benign non-proliferative pathological outcomes from the training set of C3 cases (16).
Using the data from the training set, we now aim to create and validate a cytomorphological approach based upon the probability of cancer to stratify the final outcomes of C3. The use of a probability calculator to deliver a practical approach to help in the diagnosis of difficult breast lesions and their appropriate allocation into the C3 category will be tested.

Material and Methods
Ethical approval was granted for this study by the Hunter New England Human Research Ethic Committee and the University of Newcastle, N.S.W., Australia.
A search of the laboratory archive to identify C3 cases was conducted for both the training set and the validation set. Search terms used were reports containing any of the following phrases; "atypical cytological pattern", "indeterminate cytological pattern", "suggestive/consistent with papillary lesion", "Code 3", "Category 3" or "C3". The training set contained 230 FNA sourced C3 cases comprising 73 screen detected lesions and 157 lesions from symptomatic patients. All specimens were from women aged between 16-85 years of age, with an average age of 56.3 years. In comparison, the validation set contained228 C3 cases consisting of 94 screen detected lesions and 134 lesions from symptomatic patients with an average age of 53. 1years and age range from 23 to 89 years.
The observations from a blind rescreen of the training set were previously analysed using a logistic regression and receiver operating characteristic (ROC) curve to determine which cytomorphological criteria showed a statistically significant association with the pathological outcomes of C3. The outcomes were classified into three groups; malignant, benign proliferative and benign non-proliferative. The malignant outcome included ductal carcinoma in situ (DCIS), invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC). Benign papilloma, fibroadenoma (FA), proliferative fibrocystic change (FCC), complex sclerosing lesions (CSL)/radial scar (RS), sclerosing adenosis, epithelial hyperplasia and uncommon specific tumours such as hamartoma, adenoma and benign phyllodes were grouped together under the umbrella of benign proliferative outcome.
The remaining conditions comprised the benign non-proliferative outcome and included non-proliferative FCC, lactation associated change, radiationinduced changes, fat necrosis and clinically or radiologically stable but cytologically atypical lesions, not otherwise specified, with no histological follow up.
Five common key criteria with high predictive power for the final outcomes (malignant, benign proliferative or benign non-proliferative) were identified from this previous study (16). These were: 1. cystic background defined as having foamy macrophages within a proteinaceous wash 2. cohesiveness which was assessed as being predominantly cohesive if more than 95% of material was present in groups or discohesive if more than 5% of cells presented as single intact cells 3. myoepithelial cells or bare bipolar nuclei which included the presence or absence of small dark elliptical nuclei overlying sheets of ductal epithelial cells or bipoler nuclei found In the background of smear 4. papillary fragments which were noted to be present if there was peripheral palisading or anatomical edges with a fibrovascular core 5. tubular structures which were described as angulated hose-like structures with parallel sides, sharp ends and no obvious myoepithelial cells.
Age was also factored into the calculations because of its significant predictive power for each of the final outcomes and was grouped into 2 categories, less than 50 years old and 50 years or older.
Cellularity was used as a filter for inclusion in the study, hence only cases with more than 30 groups of epithelial cells (a group of cells was defined as more than 3 epithelial cells in a cluster) were included. This left 180 cellular cases from the original 230 training set from which the statistical analysis was performed.
A blind rescreen of the validation set was conducted by one author (JW), to assess for the presence or absence of the selected cytomorphological criteria only. These observations were recorded and the probability of the combination of identified key criteria to predict each of the final outcomes was calculated. All statistical analyses were completed using Stata 11.2 (17).
In our previous study, we developed an algorithm using the key criteria to predict malignant, benign proliferative and benign non-proliferative outcomes. We now apply this algorithm to the validation set but use the coefficients derived from the training set. Table 1 lists the coefficient values for each key criterion. A ROC curve and Area Under the Curve (AUC) was then used to assess the accuracy to predict the outcome in the validation set (16,18). ROC curves plot sensitivity against one minus specificity (1specificity) and the AUC is used to measure the accuracy and discriminating power of the combined criteria. An AUC of 1.0 indicates a perfect test and an AUC below 0.75 is considered not to be clinically useful (19). The outcomes were established by correlating the histological specimen or obtaining clinical and radiological findings in the subsequent 2 years if histology was not available. The resulting AUCs were then compared with the AUCs from the training set to find any statistically significant differences The algorithm derived from the ROC curve from the malignant statistical analysis of the training set was used to create a probability calculator.
Observations were entered into the calculator which then computed the probability. The calculated probability of malignancy for each C3 case was ordered from lowest to highest and was used to produce ROC curve. Lower and upper cutpoints were established from the ROC curve.
The cut-points were selected at a value where the greatest number of cases was correctly classified as benign or malignant. The calculator was then tested against the validation set. The results of the probability calculator were ordered from lowest to highest probability and the cut-points applied. Those cases below the lower cut-point were down-graded into C2 and those above the upper cut-point were up-graded into C4. The re-graded cases were reviewed to identify probable reasons for under or overinterpretation. The probability calculator has the potential to remove cases wrongfully placed into the C3 category without compromising the integrity of these categories.

Results
The training set consisted of 180 C3 cases from which five statistically significant key cytological criteria were identified for the outcomes (malignant, benign proliferative and benign non-proliferative). Age was found to have significant predictive value and was therefore included with the key cytological criteria. The same criteria were applied to the validation set which comprised 182 subsequent cellular C3 FNA cases with followup. Figure1 displays the type of follow-up sourced to establish the pathological outcomes for both the training set and the validation set. Table 2 demonstrates the percentage breakdown for each pathological outcome in the training set and validation set.
The ROC curve analyses produced from the calculated probability of the combined selected key criteria and age group for all outcomes for both the training set and the validation set are shown in Figure 2. There were lower AUCs in the validation set for all outcomes when compared to the training set, however, the differences were only Figure 1 Type of follow up for the training set and the validation statistically significant in the benign nonproliferative ROC curve analysis (see Table 3). The ROC curve produced from the malignant analysis was used to select lower and upper cut-points as shown in Figure 3. A lower cutpoint was chosen at the probability value of 0.093. From this point 61/182 (33.5%) could be downgraded to C2. However, this included 4 encysted papillary cancers and 1 DCIS arising in a papilloma, which was incorrectly classified. An upper cut-point was chosen at the probability value of 0.75. Above this point 33/182 (18.1%) cases could be up-graded into a higher reporting category. Malignant histology was seen in 25 (76%) of the upgraded cases but 8 (24%) produced a proliferative non-malignant outcomes which included 1 fibroadenoma, 4 complex sclerosing lesions/radial scars, 1 papilloma, 1 sclerosing adenosis with epithelial hyperplasia and 1 clinically and radiologically stable case.

Discussion
The heterogeneity of pathological outcomes resulting from the atypical (C3) reporting category is wide and varied. Although C3 is a legitimate category, it is not very useful clinically and the diagnostic value of C3 is therefore limited because it contains a  mix of benign and malignant entities (5, 9-12, 14, 20, 21). This grey zone causes diagnostic dilemmas which result in unclear management directions. Therefore, the production of an evidence-based probability calculator for difficult breast FNA cases is a novel approach to stratifying the risk of malignancy when allocating FNA cases to a diagnostic category.
Our previous study of 230 C3 cases showed 34.8% of the C3 cases to be suboptimal due to low cellularity or poorly preserved material, contributing to their allocation into the C3 category. Previous statistical testing showed less reliable results when only limited material was available for assessment (16). By excluding this suboptimal group (less than 30 well-preserved epithelial groups per case) we could conduct a more meaningful statistical analysis. This enabled us to focus on the remainder 180 (65.2%) cellular but challenging C3 cases.
Our previous work (16) produced a set of statistically tested criteria with discriminating power for the stratification of atypical breast cytology. In this current study, we applied five key cytomorphological criteria and age to a validation set of closely matched cases. The comparison between the two sets of data showed similar predictive ability for all outcomes except for the benign nonproliferative conditions. We note that the validation set produced lower AUCs than the training set. The training set utilised the best predictive model for the stated pathological outcomes. This may have introduced a selection bias. By applying this best predictive model to the validation set, which was sourced from the subsequent C3 cases and hence contained a different mix of pathological outcomes than the training set, we eliminated this bias. The unbiased validation set results still produced clinically relevant results with the exception of the benign non-proliferative conditions. This analysis contained the least number of cases and mostly included clinically and radiologically stable but cytologically atypical cases without histological evaluation. The small number of cases and the inability to specifically identify histological outcomes may have impacted upon the statistical analysis producing low discretionary values. This Figure 2. Comparison of the ROC curves with AUC for the test set and validation set for each of the pathological outcomes is presented. analysis was also affected by the absence of architectural features. Although the presence of papillary and tubular structures are highly predictive of papillary lesions and malignancy respectively, the benign non-proliferative cohort in the training set did not contain any of these features thus producing an errant ROC curve when applied to the validation set as seen in Figure 2.
Application of the probability calculator produced from the malignant analysis downgraded 61/182 (33.5%) C3 to C2 and upgraded 33/182 cases (18%) from C3 to C4 in the validation set. The rationale behind  Table 3 Comparison of the calculated AUC from the outcomes of the training set and validation set (note statistically different analysis is shown in bold). choosing the lower cut-point was to find a point which included as many benign outcomes below the selected probability value but not contain false negative cases. Similarly, the upper cut-point needed to identify malignant lesions beyond the upper probability value but limit those cases of high probability but with non-malignant outcomes. These cut-points met with limited success.
When considering the cases below the lower cut-point, 5 cases from 61 (8.2%) were found to be malignant, including 4 encysted papillary carcinomas and 1 DCIS associated with a papilloma. Clearly, the calculator under-interpreted these abnormal papillary lesions, highlighting a failure of the calculator. The reason for low probability can be attributed to the presence of papillary fragments in all of these cases. This criterion has a protective effect when predicting cancer, thereby reducing the probability of cancer. However, in these 5 cases, identification of a papillary lesion was cytologically possible due to the presence of papillary fragments and discerning cytologists could override the calculator findings. This inherent failure of the calculator suggests a lower cut-point should not be used due to the risk of wrongful allocation of malignant cases into the benign category. These cases should all remain in the C3 category with specific mention of papillary features.
Of the 33 cases found beyond the upper cut-point, 25 (75.8%) resulted in malignancy however, 8 (24.2%) were proliferative but not malignant. Further analysis of this group revealed 1 fibroadenoma, 4 complex sclerosing lesions/radial scars, 1 papilloma, 1 sclerosing adenosis with epithelial hyperplasia and 1 clinically and radiologically stable case. These overcalls are understandable as the above cases can be challenging both histologically and cytologically. Placement into the C4 (suspicious) category for these nonmalignant lesions would not have changed treatment, as all these lesions in our institution undergo further histological evaluation before definitive treatment.
Five of the eight over-interpreted cases contained tubules which would have significantly increased their probability of cancer. Tubular structures were an uncommon finding and highly predictive of malignancy as shown by our previous study (16). The tubular structures were found in the fibroadenoma, the clinically and radiologically stable case, one case of sclerosing adenosis and 2 of the complex sclerosing lesions/radial scars. On review, the tubular structures were misinterpreted. We defined these structures as angulated tubular structures with parallel sides, sharp ends and no obvious myoepithelial cells. Shabb et. al. and Simsir et. al. note the finding of tubular structures should be interpreted with caution as they can also be found in fibroadenomas, sclerosing adenosis and complex sclerosing lesions but are more frequently observed in tubular carcinoma (9,22). Kundu also warns against over-interpreting tubular structures as they can mimic malignancy in cases of sclerosing adenosis but states that these should also contain myoepithelial cells. Tubular carcinoma usually has greater cellularity with more rigid or acutely angled configurations when compared to sclerosing adenosis (23). These observations have also been reported by Bozanini et. al. (24). The review of the specific cases containing tubules, showed round rather than sharp ends and obvious myoepithelial cells. Figure 4 illustrates some of the misinterpreted tubelike structures seen in fibroadenomas and complex sclerosing lesions. The failure of the calculator in this instance is due to human interpretative error.
The other 3 non-malignant cases found beyond the upper cut-point all displayed lack of benign features including cohesion, myoepithelial cells or bare bipolar nuclei and cystic background. The absence of these benign features produced higher probability hence their up-grade to C4. These included 2 complex sclerosing lesions and 1 papilloma. Orell when examining false positives caused by complex sclerosing lesions, comments that these cases often contain some proportion of co-existing benign epithelium and bare bipolar nuclei (25). Mak and Field also note the importance of not over-diagnosing cytological material as malignant in cases of complex sclerosing lesions/radial scars (26 (22,27,28). Certainly, the combination of diminished numbers of myoepithelial cells or bare bipolar nuclei in the background, loss of cohesion and absence of a cystic background amounts to a higher probabilistic malignant conclusion.
On review of these three non-malignant cases, poor presentation of the cytological material was evident including distortion and air-drying artefact, contributing to the misinterpretation of the lesions thereby limiting the use of the calculator. This statistically derived probability calculator was created to enhance decision making when interpreting difficult breast FNA and to reduce sources of bias (29,30).The methodology used for this project was intended to reduce subjectivity by providing an evidence based tool which could calculate the risk of malignancy of the combined predictive criteria thereby assist in allocating a lesion into a suitable diagnostic category. However, this calculator has limitations, as it focuses only on the microscopic features and does not include the clinical and imaging findings of the triple test. Although some advancement in diagnosis can be made by using the calculator in some instances, it is not a substitute for experience and teamwork. The atypical category by its very nature is a subjective category with biologically variable lesions and often limited by poor sampling and pre-analytical issues.
Overall, the calculator was found to be of limited value in re-assigning cases to either lower or higher categories. However, further evaluation of the calculator for inter-observer and intra-observer agreement, is required. The probability calculator may be of value in cytologically difficult cases but other diagnostic modalities such core biopsy or surgical excision is likely to be required in these circumstances.
In summary, the atypical category (C3) breast FNA cytology encompasses histologically benign and malignant conditions. Using an algorithm based upon five key cytomorphological criteria and age, developed from a training set of 180 cases and applied to a validation set of 182 subsequent C3 cases, we were able to predict malignant and benign proliferative outcomes with reasonable accuracy. Benign nonproliferative changes could not be reliably diagnosed using the selected criteria.
The ability of the probability calculator to aid in the allocation of cases into the C3 reporting category was limited. Cut-points based on the probability of malignancy were chosen to regrade cases from C3 into lower or higher categories. At the lower cut-point, the calculator re-allocated 61 (33.5%) cases into C2, however 5 (8.2%) papillary malignancies were under-interpreted (false-negatives), highlighting the limitation of the calculator, which was considered unacceptable. Upgrading C3 to a higher category (C4) yielded 33/182 (18.1%) of the cases. However, this up-graded group contained 8 benign proliferative lesions (24.2%), signifying over interpretation and wrongful allocation into C4.
The C3 category remains a heterogeneous mix of pathological lesions. Attempts to minimise inappropriate allocation of C3 cases into this category have met with limited success. Figure 4. Illustration of some misidentified tube like structures. Panel A depicts a tubular structure from a fibroadenoma. Note the rounded ends and overlying myoepithelial cells (QD stain, original magnification x60). Panel B is from a complex sclerosing lesion and shows a rigid type tubular structure with elongated peripheral nuclei and sharp ends but obvious overlying myoepithelial cells (Papanicolaou stain, original magnification x60).