21  Lab: Exploring a Dataset

We want to take a look at this real-world dataset:
https://openneuro.org/datasets/ds005420/versions/1.0.0

import pathlib
path = pathlib.Path("../data/ds005420-download")
content = list(path.iterdir())
content[:10]
[PosixPath('../data/ds005420-download/sub-50'),
 PosixPath('../data/ds005420-download/sub-40'),
 PosixPath('../data/ds005420-download/sub-45'),
 PosixPath('../data/ds005420-download/sub-9'),
 PosixPath('../data/ds005420-download/sub-35'),
 PosixPath('../data/ds005420-download/sub-16'),
 PosixPath('../data/ds005420-download/CHANGES'),
 PosixPath('../data/ds005420-download/sub-2'),
 PosixPath('../data/ds005420-download/sub-36'),
 PosixPath('../data/ds005420-download/sub-21')]

As we see, there are files, and directories.

21.1 Exercises 1

  1. List only the sub-directories in path.
  2. List only the sub-directories with subject data.
  3. Delete files: CHANGES and participants.tsv

21.2 Exercises 2

We will start by making sure our data/metadata contains the information we expect at a high level.

  1. Verify that all subject directories have a eeg sub-directory.
  2. Verify that all data in a subject directories matches with the subject number.
  3. Assert that EEG data for all subjects was taken using 20 channels and sampling frequency 500.
  4. (Optional) Write a file (discarded_subjects.txt) with the subject numbers that do not match that criterion.

21.3 Exercises 3

Now we want to look at the data. We find that the data is in a particular format .edf that we cannot directly read in python.
Hint: We need to install a third-party library mne to read .edf files.

  1. Plot a histogram of RecordingDuration across all subjects.
  2. Plot one time series.
  3. Plot all time series with labels according to channel name.
  4. Plot the channels that start with “T” and “O”.
  5. Plot a correlation plot of the “T” and “O” channels as a heatmap.

21.4 Exercises 4

After having taken this quick look at the data, we want to start processing the data.

  1. Add a column called channel_mean that
  2. Substract the mean from each channel
  3. Plot correlation matrix of all-vs-all channels