CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

GitHub Repository: adasegroup/NEUROML2022
Path: blob/main/seminar1/hw1-baseline.ipynb
Views: ⁶³

Kernel: Python 3

HW1 - baseline

In this notebook we handle the homework data in order to predict motion over rest using EEG

In [1]:

# For Colab only
!pip install mne
!wget https://raw.githubusercontent.com/adasegroup/NEUROML2020/seminar1/seminar1/train.csv
!wget https://raw.githubusercontent.com/adasegroup/NEUROML2020/seminar1/seminar1/test.csv

In [28]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from mne.time_frequency import psd_array_multitaper

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.manifold import TSNE

%matplotlib inline

In [29]:

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [30]:

df_train.head()

In [31]:

ch_names = df_train.columns[3:]

In [32]:

epochs = df_train['epoch'].unique()

In [33]:

epochs

array([  0,   2,   6,   8,  11,  13,  15,  16,  17,  18,  19,  20,  21,
        22,  23,  24,  25,  26,  29,  30,  31,  32,  33,  34,  36,  37,
        38,  39,  42,  43,  44,  46,  47,  48,  49,  50,  51,  52,  53,
        54,  55,  56,  58,  60,  61,  62,  64,  65,  66,  67,  68,  69,
        74,  77,  79,  80,  81,  86,  87,  88,  89,  90,  91,  93,  95,
        96,  97,  99, 101, 102, 105, 107, 109, 110, 111, 113, 115, 117,
       118, 126, 127, 128, 129, 131, 134, 135, 136, 137, 138, 139, 141,
       142, 143, 144, 145, 147, 151, 152, 154, 155, 156, 157, 158, 159,
       160, 162, 164, 166, 167, 169, 171, 172, 173, 174, 175, 176, 177,
       181, 182, 184, 185, 187, 192, 193, 194, 196, 197, 200, 201, 202,
       204, 205, 210, 212, 216, 217, 221, 222, 223, 225, 226, 227, 228,
       230, 231, 233, 234, 235, 237, 239, 240, 241, 244, 245, 246, 248,
       250, 253, 254, 255, 261, 262, 263, 265, 268, 269, 270, 276, 277,
       279, 281, 283, 285, 287, 290, 292, 293, 294, 297, 298])

In [34]:

def get_target(df):
    return df.drop_duplicates('epoch')[['epoch', 'condition']].reset_index(drop=True)

Idea for feature engineering

In [35]:

df_train[df_train['condition'] == 1].groupby('time')['F4'].mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b9d283a90>

In [36]:

df_train[df_train['condition'] != 1].groupby('time')['F4'].mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b9d0cbf10>

In [37]:

def calc_features(df):
    feats = []
    for epoch_idx, epoch_df in df.groupby('epoch'):

        epoch_df = epoch_df[ch_names]

        psds, freqs = psd_array_multitaper(epoch_df.T.values, 160, verbose=False)

        total_power = psds.sum(axis=1)

        idx_from = np.where(freqs > 13)[0][0]
        idx_to = np.where(freqs > 25)[0][0]
        b_pwr = psds[:,idx_from:idx_to].sum(axis=1) / total_power

        d = {}
        d['epoch'] = epoch_idx

        for ch in ch_names:
            s = epoch_df.iloc[40:][ch]
            val = (s > 5).sum()
            d[ch.lower() + '_p300'] = val

        feats.append(d)

    feats_df = pd.DataFrame(feats)
    
    return feats_df

Common ML workflow

In [154]:

X = get_target(df_train)
X = X.merge(calc_features(df_train), on='epoch')
y = X['condition'].apply(lambda x: 0 if x == 1 else 1)
del X['epoch']
del X['condition']

In [155]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [156]:

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

In [157]:

model = LogisticRegression(C=1)

In [167]:

model.fit(X_train_sc, y_train)

LogisticRegression(C=1)

In [168]:

y_pred_train = model.predict_proba(X_train_sc)[:, 1]
roc_auc_score(y_train, y_pred_train)

0.7982700892857142

In [169]:

y_pred = model.predict_proba(X_test_sc)[:, 1]

In [170]:

roc_auc_score(y_test, y_pred)

0.7171945701357466

Visualize t-SNE

In [171]:

X_embedded = TSNE(n_components=2).fit_transform(X)

In [172]:

plt.scatter(X_embedded[np.where(y == 0), 0], X_embedded[np.where(y == 0), 1])
plt.scatter(X_embedded[np.where(y == 1), 0], X_embedded[np.where(y == 1), 1])

<matplotlib.collections.PathCollection at 0x7f1b9c548e90>

Build submission

In [227]:

scaler = StandardScaler()
X_sc = scaler.fit_transform(X)

In [228]:

model.fit(X_sc, y)

LogisticRegression(C=1)

In [229]:

X_test = calc_features(df_test)
submission = X_test[['epoch']].copy()
del X_test['epoch']
X_test_sc = scaler.transform(X_test)

In [230]:

y_pred = model.predict_proba(X_test_sc)[:, 1]

In [231]:

submission['Predicted'] = y_pred

In [232]:

submission['Id'] = submission['epoch']
del submission['epoch']

In [233]:

submission.to_csv('baseline_submission.csv', index=False)

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

HW1 - baseline

Idea for feature engineering

Common ML workflow

Visualize t-SNE

Build submission

Product

Resources

Company