21 Lab: Exploring a Dataset

We want to take a look at this real-world dataset:
https://openneuro.org/datasets/ds005420/versions/1.0.0

import pathlib

path = pathlib.Path("../data/ds005420-download")

content = list(path.iterdir())
content[:10]

[PosixPath('../data/ds005420-download/sub-50'),
 PosixPath('../data/ds005420-download/sub-40'),
 PosixPath('../data/ds005420-download/sub-45'),
 PosixPath('../data/ds005420-download/sub-9'),
 PosixPath('../data/ds005420-download/sub-35'),
 PosixPath('../data/ds005420-download/sub-16'),
 PosixPath('../data/ds005420-download/CHANGES'),
 PosixPath('../data/ds005420-download/sub-2'),
 PosixPath('../data/ds005420-download/sub-36'),
 PosixPath('../data/ds005420-download/sub-21')]

As we see, there are files, and directories.

21.1 Exercises 1

List only the sub-directories in path.
List only the sub-directories with subject data.
Delete files: CHANGES and participants.tsv

21.2 Exercises 2

We will start by making sure our data/metadata contains the information we expect at a high level.

Verify that all subject directories have a eeg sub-directory.
Verify that all data in a subject directories matches with the subject number.
Assert that EEG data for all subjects was taken using 20 channels and sampling frequency 500.
(Optional) Write a file (discarded_subjects.txt) with the subject numbers that do not match that criterion.

21.3 Exercises 3

Now we want to look at the data. We find that the data is in a particular format .edf that we cannot directly read in python.
Hint: We need to install a third-party library mne to read .edf files.

Plot a histogram of RecordingDuration across all subjects.
Plot one time series.
Plot all time series with labels according to channel name.
Plot the channels that start with “T” and “O”.
Plot a correlation plot of the “T” and “O” channels as a heatmap.

21.4 Exercises 4

After having taken this quick look at the data, we want to start processing the data.

Add a column called channel_mean that
Substract the mean from each channel
Plot correlation matrix of all-vs-all channels