Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation

AMIA Annu Symp Proc. 2018 Apr 16:2017:1070-1079. eCollection 2017.

Abstract

De-identification of clinical notes is a special case of named entity recognition. Supervised machine-learning (ML) algorithms have achieved promising results for this task. However, ML-based de-identification systems often require annotating a large number of clinical notes of interest, which is costly. Domain adaptation (DA) is a technology that enables learning from annotated datasets from different sources, thereby reducing annotation cost required for ML training in the target domain. In this study, we investigate the use of DA methods for deidentification of psychiatric notes. Three state-of-the-art DA methods: instance pruning, instance weighting, and feature augmentation are applied to three source corpora of annotated hospital discharge summaries, outpatient notes, and a mixture of different note types written for diabetic patients. Our results show that DA can increase deidentification performance over the baselines, indicating that it can effectively reduce annotation cost for the target psychiatric notes. Feature augmentation is shown to increase performance the most among the three DA methods. Performance variation among the different types of clinical notes is also observed, showing that a mixture of different types of notes brings the biggest increase in performance.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Data Anonymization*
  • Datasets as Topic
  • Diabetes Mellitus
  • Electronic Health Records*
  • Humans
  • Machine Learning*
  • Natural Language Processing
  • Outpatients
  • Patient Discharge Summaries
  • Psychiatry*