Nikola Rahman, HWAI Engineer at HTEC.
Medical images such as MRIs, CT scans, and X-rays are among the most important tools doctors use in diagnosing conditions ranging from spine injuries to heart disease and cancer. However, analyzing medical images is often a difficult, time-consuming, expensive and error-prone process that requires experts in medical image analysis (e.g., radiologists, ECG experts, etc.). Deep learning can help doctors make faster and more accurate diagnoses. It can predict the risk of a disease in time to prevent it.
With a substantial increase in the number of people with cardiac diseases and heart conditions, the problematic availability of medical staff and the conflicting schedules of patients and their physicians, regular control and monitoring of such conditions are postponed in favor of other obligations. HTEC has stepped in with HUMEDS Cardea-3, an innovative telehealth technology, which allows comfortable and straightforward ECG recording and analysis, and also enables quick, automatic and precise diagnosis of cardiac arrhythmias using machine learning algorithms.
Atrial fibrillation (AFIB) is a type of cardiac arrhythmia that affects approximately 33.5 million individuals worldwide. AFIB is a significant risk factor for stroke, implicated in 15 - 30% of stroke cases and often not diagnosed beforehand. Stroke risk can be reduced by about two thirds through the use of oral anticoagulant (OAC) therapy (including non-vitamin K antagonist OAC [NOACs]). This means that stroke is preventable in one out of three cases if detected on time.
AFIB and normal sinus rhythm (NSR) episodes are illustrated in Figures 1 and 2. Figure 2 shows fibrillatory waves in the heart’s atrium, which is how the atrial fibrillation got its name. There are two manifestations of an AFIB episode in a patient’s ECG signal:
- Fibrillatory waves that are indicated by the black arrow on the top image;
- Irregular heart rhythm as a consequence of fibrillatory waves. Since there is no organized electrical activity in the atrium, like in an NSR episode, the ventricle conducts in random intervals and manifests as irregular heart rhythm.
Figure 1: An example of an AFIB episode (red) and an NSR episode (blue).
Figure 2: Animation of a heart in an AFIB episode (left) and an NSR episode (right).
Our goal, at HTEC, is to make use of our machine learning expertise and in collaboration with cardiology experts develop a binary classification model that predicts whether the input ECG signal is an AFIB episode or not. Discovering potential AFIB is an important problem that can potentially prevent millions of strokes.
The dataset we used consisted of 2,400 hours of multi-channel ECG data from 256 patients. It contained ECG records from patients with permanent and paroxysmal (intermittent) AFIB, healthy patients, and patients with a known history of other types of heart diseases. The dataset was evenly balanced between AFIB and Non-AFIB examples.
To prepare the dataset for processing, we extracted 15-second ECG excerpts from each channel and used them as training, validation and testing examples. This gave us around 3 million 15-second examples. We used around 80% of the dataset for training and 10% each for validation and testing. The split was performed in an interpatient manner, meaning that no patient was in two different datasets, e.g., both in training and validation.
We applied a bandpass filter in 0.5 Hz–80 Hz band to remove baseline wandering noise and high-frequency noise from the ECG — both common types of noise found in ECGs.
There is a lot of empirical evidence that Convolutional Neural Networks (CNNs) work well with unstructured data-like images, videos, sounds, medical images, etc. Our approach for AFIB classification is based on ResNet CNN architecture [ref1, ref2]. ResNets use skip connections that help the optimization of the CNN parameters, which facilitates training of the deeper nets that perform better both on training and test sets. Figure 3 shows a block diagram of our ResNet for the AFIB classification.
Figure 3: Block diagram of a CNN for ECG classification
- Adam optimizer with default parameters
- Minibatch training with minibatch size 64
- Data augmentation: random scale, shift, inversion and adding electrode motion artifact
- Reduce learning rate by order of magnitude when validation loss plateaus
- Train for 100 epochs (about six days on GTX 1080 Ti)
Moving to Error Analysis
It was discovered during the error analysis that most errors were false positives. A closer look at these errors showed that about 80% of errors were ECG signals that were not an NSR episode, but an arrhythmia that is not AFIB. These examples were significantly under-represented in the dataset, comprising only 4% of it. Several approaches to overcome this problem were considered:
- Removing the NSR and AFIB examples from the dataset in order to balance the classes (here, any kind of arrhythmia that is not AFIB is considered a separate class). Choosing the examples to keep would be done using clustering.
- Copying examples from the under-represented class (ARRH)
- Weighted loss function — putting higher weight on examples that come from the ARRH class
- Sampling mini-batches from a categorical distribution (i.e., generalized Bernoulli distribution) — probability pi of an example xi being sampled for the mini-batch is set to be inversely proportional to the frequency of that example’s arrhythmia class yi, where yiNSR, AFIB, ARRH. The effect will be that, on average, one-third of the examples in a mini-batch will be NSR, one-third AFIB, and one third ARRH episodes. This is performed in training time.
We chose the last approach since it did not require excluding the possibly useful examples of AFIB or NSR and it did not need any redundant example copies. We also played a bit with weighting the loss function, but the sampling mini-batches from the categorical distribution gave the best results on the validation set.
Our goal was to outperform the baseline model which used a feed-forward neural network and hand-engineered features for AFIB detection. These features model heart rhythm dynamics and morphology and spectral content of the atrial activity. The positive predictive value (PPV) was used as an evaluation metric with a constraint that sensitivity must not be lower than the sensitivity of a baseline model (0.910).
Figure 4 compares the baseline model and the CNN model results. The CNN model is a clear winner here with a reduction of the false positive rate by 40% and reduction of the false negative rate by 56%. These results are in favor of the assumption that the CNN will give better results if given enough data.
Figure 4: The comparison of the baseline model (blue) and the CNN model (green) results.
Conclusion and Future Steps
It is not a surprise that the CNN model with minimal preprocessing gives better performance than a baseline model that takes hand engineered features as inputs. The CNN model yields both higher sensitivity and PPV score, and much less effort put into the feature engineering and minimal preprocessing. The downside to the CNN approach is its need for a vast amount of training data and the current error analysis shows that most errors come from arrhythmias that have very few examples in the training set. Collecting more examples of these types of arrhythmias would almost certainly boost the score, but the process of collecting and labeling medical data is time-consuming and expensive. Until then, one should not forget about the more traditional approaches in machine learning and signal processing that perform better on smaller amounts of data.