NLP for Symptom Detection

Natural Language Processing for Symptom Detection in Unstructured Provider-Patient Conversation -- MIT Machine Learning in Healthcare Semester Project

Full text:


A summary is shown below.

Objective

We identify symptom-related conversation segments in physician-patient dialogue using natural language processing (NLP) techniques, with the goal of developing an automated pipeline for symptom detection in unstructured clinical conversation. Our work helps address the high symptom burden placed on patients, improving care satisfaction and decreasing healthcare costs.

Dataset Overview
  • Turn-level conversation data between patients and their healthcare provides
  • Roughly 79,000 turns spanning over 181 unique conversations and 94 unique patients
  • Around 13% of the turns are symptom-related

Model Comparison



Results
Left: Precision-recall curve for each method tested. Right: Receiver-Operator Characteristic (ROC) curves for each method. The dashed gray line indicates the curve for a perfectly random classifier. The associated AUROC for each of these curves can be found in Table 1. Error bars show the 95% confidence interval and were estimated with a bootstrap method using 250 resamples per threshold value.
Quantitative metrics for each model on the non-preprocessed data.
SHAP Plot using XGBoost.



Takeaways
  • Transformer-based BERT model performed the best, but LSTM is second best and is less computationally-intensive
  • BioBERT performed poorly, suggesting conventional NLP models are sufficient
  • Bag-of-words models are very interpretable; next steps may focus on the interpretation of the deep learning models
  • Future goal is to incorporate symptom detection in an automatic pipeline during patient care

Full text:
Github: