Can AI Generate Clinically Appropriate X-Ray Reports? Judging the Accuracy and Clinical Validity of Deep Learning-generated Test Reports as Compared to Reports Generated by Radiologists: A Retrospective Comparative Study (RSNA 2019, Sun Dec 01 – 06)

PURPOSE

Implementations of deep learning algorithms in clinical practice are limited by the nature of output provided by the algorithms. We evaluate the accuracy, clinical validity, clarity, consistency and level of hedging of AI-generated Chest X-Ray (CXRs) compared to radiologist-generated clinical reports.

METHOD AND MATERIALS

297 CXRs done on a Conventional X-Ray system (GE Healthcare, USA) fitted with a Retrofit DR (Konika Minolta, Japan) were pulled from the PACS along with their corresponding reports. The anonymised CXRs were analysed by a CE approved deep learning-based CXR analysis algorithm (ChestEye, Oxipit, Lithuania) which detects abnormalities and autogenerates clinical reports. The algorithm is an ensemble of multiple classification, detection and segmentation neural networks capable of identifying 75 different radiological findings and perform findings' location extraction. The outputs from this model are used by a custom automatic text generator tailored by multiple radiologists to produce a structured and cohesive report. These models were trained using around 1 million chest X-rays coming from multiple data sources. The algorithm was not trained or tested before on CXRs from our institution. An informed review was performed by a radiologist with 9 years' experience to evaluate both the reports for the accuracy as well the clinical appropriateness of the reports.

RESULTS

In 236 (79%) cases, algorithm-generated reports were found to be as accurate as the radiologists' reports. In 16 (5%) cases, algorithm-generated reports were found to be either more accurate or more clinically appropriate. In 18 (6%) cases, the algorithm made significant diagnostic errors and in 27 (9%) cases, the algorithm-generated reports were found to be clinically inappropriate or insufficient even though the significant findings were correctly identified and localised.

CONCLUSION

We demonstrate, for the first time as of this date, a comparison between reports auto-generated by a deep learning algorithm and a practicing radiologist. We report good comparability of the clinical appropriateness of the reports generated by a DL network having high accuracy, paving the way for a new potential deployment strategy of AI in radiology.

CLINICAL RELEVANCE/APPLICATION

We report on an algorithm with potential to produce standardized, accurate reports in a manner that is easily understandable and deployable in the clinical environment.