Part B - Data Analysis of EEG Dataset
Dataset and Classification Problem
We use a modified and restructured form of the popular and much cited Epileptic Seizure Recognition Data Set (Andrzejak et al. 2001). The original data set is composed of five sets labelled A-E. For sets A and B, surface EEG recordings were taken on 5 healthy volunteers; set A with their eyes closed, set B with eyes open. The remaining sets are all intercranial EEG recordings taken from patients undergoing presurgical diagnosis. In set C, readings were taken from a healthy area of the brain, while for set D the readings originate in an unhealthy, epileptogenic zone. Sets C and D have no occurrence of seizures, while set E consists of only seizure activity, with readings from various regions of the brain during an ictal period (period of seizure).
Within each set, are 100 files, with each file being 23.6 seconds of EEG activity, with sampling rate of 173.61 Hz, giving 4097 data points per file. Thus, in total, there are 2,048,500 individual samples.
The modified form of the dataset that we use (Q. Wu, E. & Fokoue, R. G. A., 2017) was created by splitting each of the 500 files into 23 segments, then shuffling these 11,500 segments and annotating with a label y ϵ {1,2,3,4,5}. Wu and Fokoue state that each segment is for 1 second, but it appears each segment is for a slightly longer duration of 1.0253 seconds (using the 173.61 Hz value used in (Andrzejak et al. 2001)) and that 3 samples have been dropped from each file, as the total size of their dataset is only 2,047,000 samples. The resulting dataset is saved as a single CSV.
The labels correspond to the original 5 sets as follows:
y
Set
Description
1
E
Seizure
2
D
Unhealthy brain area
3
C
Healthy brain area
4
B
Eyes open
5
A
Eyes closed
Classification & Evaluation
Pre-processing
We consider classification of instances into two classes; S, representing seizure, and NS, representing non-seizure.
y
Description
Instances (original)
Instances (after SMOTE)
S
Seizure
2300
4600
NS
Non-seizure
9200
9200
After aggregation of instances to the NS label, we have a ratio of 4:1 between NS and S. Such a ratio may make models prone to overfitting. One way to combat this is by generating synthetic data, using methods such as Synthetic Minority Over-Sampling Technique (SMOTE). Using the SMOTE algorithm with k=5 nearest neighbours the dataset is rebalanced to a 2:1 ratio.
The dataset is then randomised to disperse the synthetic data and split 70:30 for train and test sets respectively.
Train
Test
y
Instances
y
Instances
S
3201
S
1399
NS
6459
NS
2741
Since we have a large sample size, over 2M, we can afford to keep a test set separate even though we will use k-fold cross validation [REF?].
Classification Algorithms
Two classification algorithms are compared on the training set - KNN (IBk in Weka) and C4.5 Decision Tree (J48 in Weka). We use 10-fold stratified cross-validation in each case.
KNN classifier, with k=1, is applied to the dataset.
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 9487 98.2091 %
Incorrectly Classified Instances 173 1.7909 %
Kappa statistic 0.9592
Mean absolute error 0.018
Root mean squared error 0.1338
Relative absolute error 4.0664 %
Root relative squared error 28.4273 %
Total Number of Instances 9660
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.952 0.003 0.994 0.952 0.972 0.960 0.974 0.962 S
0.997 0.048 0.977 0.997 0.987 0.960 0.974 0.976 NS
W. Avg. 0.982 0.033 0.982 0.982 0.982 0.960 0.974 0.971
=== Confusion Matrix ===
a b <-- classified as
3047 154 | a = S
19 6440 | b = NS
J48 Decision Tree (J48 DT) classifier, with confidence factor of 0.25 and minimum 2 leaf's per node, is applied to the dataset (omitting the tree itself in output from Weka for brevity).
Time taken to build model: 8.53 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 8945 92.5983 %
Incorrectly Classified Instances 715 7.4017 %
Kappa statistic 0.8329
Mean absolute error 0.0771
Root mean squared error 0.2675
Relative absolute error 17.3878 %
Root relative squared error 56.8197 %
Total Number of Instances 9660
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.888 0.055 0.889 0.888 0.888 0.833 0.902 0.840 S
0.945 0.112 0.944 0.945 0.945 0.833 0.902 0.896 NS
W. Avg. 0.926 0.093 0.926 0.926 0.926 0.833 0.902 0.877
=== Confusion Matrix ===
a b <-- classified as
2842 359 | a = S
356 6103 | b = NS
The way in which Weka reports True/False Positive/Negative may be somewhat confusing, so to clarify when we use these terms, we show the corresponding locations in the Weka output below.
TP Rate FP Rate Class
TP FP S
TN FN NS
a b <-- classified as
TP FN | a = S
FP TN | b = NS
KNN appears to have performed significantly better, with AUC of 0.974 compared to AUC 0.902 for J48 DT. Both true positives and negatives are higher with KNN. False negatives with KNN are less than half those for J48 DT; this is an important metric for any practical application of the model, given the clinical nature of data. In addition, KNN achieves less false positives by a factor of almost 20.
We test for any significant difference in the AUC measure using 2-tailed T-test and find that the difference is statistically significant with a confidence of 95%.
Analysing: Area_under_ROC
Datasets: 1
Resultsets: 2
Confidence: 0.05 (two tailed)
Sorted by: -
Date: 31/05/2020, 18:35
Dataset (1) lazy.IB | (2) tree
------------------------------------------------
data2class-weka.filters.s(100) 0.97 | 0.91 *
------------------------------------------------
(v/ /*) | (0/0/1)
Key:
(1) lazy.IBk '-K 1 -W 0 -A \"weka.core.neighboursearch.LinearNNSearch -A \\\"weka.core.EuclideanDistance -R first-last\\\"\"' -3080186098777067172
(2) trees.J48 '-C 0.25 -M 2' -217733168393644444
Principal Component Analysis (PCA)
The main reason that PCA is used is to avoid the so-called curse of dimensionality. PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. [9]
In this way, one can take the first n principal components up to a threshold of variance explained. It does involve a trade-off between such loss of information and the benefit of using a lower number of dimensions, but as can be seen in Figure 8, most of the information is 'up front', so the trade-off is in many if not most cases worthwhile.
PCA was performed on the EEG dataset and the cumulative variance explained for the first 100 components was calculated. The variance explained by using n principal components is plotted below.
Figure 8 - Variance explained as a function of number of principal components
The variance explained rapidly increases as n moves to 40; by n=38, over 95% of the variance is explained. The last 126 principal components account for 1% of the variation, as at n=52 over 99% of variance is explained.
PCA is also useful in visualisation of data, by reducing it to 2 or 3 dimensions. However, in all but the simplest datasets, the majority of information will be lost in doing so. To plot the EEG dataset in 2D (Figure 9), almost 90% of the information contained in the dataset would be lost, as at n=2, just 10.85% of variance is explained.
Figure 9 - Visualising the EEG dataset via PCA
Feature Selection
We apply Correlation-based Feature Selection (CFS), with Best First search. CFS is a metric to evaluate the efficacy of a feature subset, which is used as the heuristic in the employed search algorithm. (Hall, 2000) states that “a good feature subset is one that contains features highly correlated with the class yet uncorrelated with each other”.
=== Attribute Selection on all input data ===
Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 11563
Merit of best subset found: 0.554
Attribute Subset Evaluator (supervised, Class (nominal): 179 y):
CFS Subset Evaluator
Including locally predictive attributes
Selected attributes: 1,2,6,8,9,13,16,17,20,21,25,26,30,31,32,33,37,38,41,44,47,50,51,53,56,58,61,63,65,69,71,75,77,82,85,87,89,93,94,96,97,98,102,103,106,107,109,112,115,116,118,120,122,126,127,129,130,133,135,138,140,142,144,146,150,152,153,156,157,161,162,165,167,169,171,173,174,175,176,178 : 80
CFS selects 80 predictive attributes out of the original 178 features. J48 Decision Tree classifier is applied to this reduced dataset, with 10-fold stratified cross-validation.
Time taken to build model: 4.28 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 9001 93.1781 %
Incorrectly Classified Instances 659 6.8219 %
Kappa statistic 0.8461
Mean absolute error 0.0722
Root mean squared error 0.2568
Relative absolute error 16.297 %
Root relative squared error 54.5623 %
Total Number of Instances 9660
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.898 0.051 0.896 0.898 0.897 0.846 0.912 0.848 S
0.949 0.102 0.949 0.949 0.949 0.846 0.912 0.908 NS
W. Avg. 0.932 0.085 0.932 0.932 0.932 0.846 0.912 0.888
=== Confusion Matrix ===
a b <-- classified as
2874 327 | a = S
332 6127 | b = NS
Comparing with the previous results, with all features present (repeated below), all measures have improved, but they appear to be only small improvements; the improvements could well be due to chance.
Correctly Classified Instances 8945 92.5983 %
Incorrectly Classified Instances 715 7.4017 %
Kappa statistic 0.8329
Mean absolute error 0.0771
Root mean squared error 0.2675
Relative absolute error 17.3878 %
Root relative squared error 56.8197 %
Total Number of Instances 9660
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.888 0.055 0.889 0.888 0.888 0.833 0.902 0.840 S
0.945 0.112 0.944 0.945 0.945 0.833 0.902 0.896 NS
W. Avg. 0.926 0.093 0.926 0.926 0.926 0.833 0.902 0.877
=== Confusion Matrix ===
a b <-- classified as
2842 359 | a = S
356 6103 | b = NS
We test for any significant difference in the AUC measure, using 2-tailed T-test.
Analysing: Area_under_ROC
Datasets: 2
Resultsets: 1
Confidence: 0.05 (two tailed)
Sorted by: -
Date: 31/05/2020, 19:36
Dataset (1) trees.J48
---------------------------------------
data2class-weka.filters.s(100) 0.91 |
'data2class-weka.filters.(100) 0.91 |
---------------------------------------
(v/ /*) |
Key: (1) trees.J48 '-C 0.25 -M 2' -217733168393644444
Even after more than half of the features are removed from the dataset, the performance has not been affected, with a confidence of 95%. The AUC for J48 Decision Tree classifier is 0.91.
Figure 10 - ROC for J48 Decision Tree
Comparison of Time Taken
We expected the increase in time to be exponential between training with a reduced feature set and the full set (). Here, using 80 features, it took around half the time it took to build the model with 178 features, which is close to linear. There are several factors in paly that could affect the comparative times. For example, we have only 2.5M data points with a binary classification problem. Perhaps the overhead of the processes starting when the model building takes place has masked the actual time needed to complete each model. This could be confirmed by running a much longer, or many sets of, model building and then performing statistical analysis on them.
References
[1] https://neo4j.com/docs/cypher-manual/current/administration/constraints [Accessed 03 May 2020]
[2] https://docs.mongodb.com/manual/tutorial/unique-constraints-on-arbitrary-fields [Accessed 03 May 2020]
[3] https://neo4j.com/developer/graph-algorithms [Accessed 03 May 2020]
[4] https://github.com/stellasia/neomap [Accessed 06 May 2020]
[5] http://bsonspec.org [Accessed 06 May 2020]
[6] Andrzejak, R. G., Lehnertz, K., Mormann, F., Rieke, C., David, P. and Elger, C. E. 2001. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys. Rev. E, 64 6 061907
[7] Wu, Q. and Fokoue, E., 2017. Epileptic seizure recognition data set [online]. UCI Machine Learning Repository. Available from: https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition
[8] Hall, M. A. 2000. Correlation-based feature selection for discrete and numeric class machine learning. Proceedings of the 17th International Conference on Machine Learning, (ICML '00), 359-366
You must log in to see the comments for this news story.