gallantlab
diff --git a/‎tutorials/notebooks/shortclips/vem_tutorials_merged_for_colab.ipynb‎
Lines changed: 95 additions & 70 deletions b/‎tutorials/notebooks/shortclips/vem_tutorials_merged_for_colab.ipynb‎
Lines changed: 95 additions & 70 deletions
@@ -1422,6 +1422,36 @@
     "print(\"(n_repeats, n_samples_test, n_voxels) =\", Y_test.shape)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before fitting an encoding model, the fMRI responses are typically z-scored over time. This normalization step is performed for two reasons.\n",
+    "First, the regularized regression methods used to estimate encoding models generally assume the data to be normalized {cite:t}`Hastie2009`. \n",
+    "Second, the temporal mean and standard deviation of a voxel are typically considered uninformative in fMRI because they can vary due to factors unrelated to the task, such as differences in signal-to-noise ratio (SNR).\n",
+    "\n",
+    "To preserve each run independent from the others, we z-score each run separately."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy.stats import zscore\n",
+    "\n",
+    "# indice of first sample of each run\n",
+    "run_onsets = load_hdf5_array(file_name, key=\"run_onsets\")\n",
+    "print(run_onsets)\n",
+    "\n",
+    "# zscore each training run separately\n",
+    "Y_train = np.split(Y_train, run_onsets[1:])\n",
+    "Y_train = np.concatenate([zscore(run, axis=0) for run in Y_train], axis=0)\n",
+    "# zscore each test run separately\n",
+    "Y_test = zscore(Y_test, axis=1)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1443,6 +1473,9 @@
    "outputs": [],
    "source": [
     "Y_test = Y_test.mean(0)\n",
+    "# We need to zscore the test data again, because we took the mean across repetitions.\n",
+    "# This averaging step makes the standard deviation approximately equal to 1/sqrt(n_repeats)\n",
+    "Y_test = zscore(Y_test, axis=0)\n",
     "\n",
     "print(\"(n_samples_test, n_voxels) =\", Y_test.shape)"
    ]
@@ -1510,7 +1543,8 @@
     "following time sample in the validation set. Thus, we define here a\n",
     "leave-one-run-out cross-validation split that keeps each recording run\n",
     "intact.\n",
-    "\n"
+    "\n",
+    "We define a cross-validation splitter, compatible with ``scikit-learn`` API."
    ]
   },
   {
@@ -1524,27 +1558,6 @@
     "from sklearn.model_selection import check_cv\n",
     "from voxelwise_tutorials.utils import generate_leave_one_run_out\n",
     "\n",
-    "# indice of first sample of each run\n",
-    "run_onsets = load_hdf5_array(file_name, key=\"run_onsets\")\n",
-    "print(run_onsets)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We define a cross-validation splitter, compatible with ``scikit-learn`` API.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": false
-   },
-   "outputs": [],
-   "source": [
     "n_samples_train = X_train.shape[0]\n",
     "cv = generate_leave_one_run_out(n_samples_train, run_onsets)\n",
     "cv = check_cv(cv)  # copy the cross-validation splitter into a reusable list"
@@ -1558,19 +1571,24 @@
     "\n",
     "Now, let's define the model pipeline.\n",
     "\n",
+    "With regularized linear regression models, it is generally recommended to normalize \n",
+    "(z-score) both the responses and the features before fitting the model {cite:t}`Hastie2009`. \n",
+    "Z-scoring corresponds to removing the temporal mean and dividing by the temporal standard deviation.\n",
+    "We already z-scored the fMRI responses after loading them, so now we need to specify\n",
+    "in the model how to deal with the features. \n",
+    "\n",
     "We first center the features, since we will not use an intercept. The mean\n",
     "value in fMRI recording is non-informative, so each run is detrended and\n",
     "demeaned independently, and we do not need to predict an intercept value in\n",
     "the linear model.\n",
     "\n",
-    "However, we prefer to avoid normalizing by the standard deviation of each\n",
-    "feature. If the features are extracted in a consistent way from the stimulus,\n",
+    "For this particular dataset and example, we do not normalize by the standard deviation \n",
+    "of each feature. If the features are extracted in a consistent way from the stimulus,\n",
     "their relative scale is meaningful. Normalizing them independently from each\n",
     "other would remove this information. Moreover, the wordnet features are\n",
     "one-hot-encoded, which means that each feature is either present (1) or not\n",
     "present (0) in each sample. Normalizing one-hot-encoded features is not\n",
-    "recommended, since it would scale disproportionately the infrequent features.\n",
-    "\n"
+    "recommended, since it would scale disproportionately the infrequent features."
    ]
   },
   {
@@ -2096,7 +2114,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Similarly to [1]_, we correct the coefficients of features linked by a\n",
+    "Similarly to {cite:t}`huth2012`, we correct the coefficients of features linked by a\n",
     "semantic relationship. When building the wordnet features, if a frame was\n",
     "labeled with `wolf`, the authors automatically added the semantically linked\n",
     "categories `canine`, `carnivore`, `placental mammal`, `mamma`, `vertebrate`,\n",
@@ -2272,10 +2290,11 @@
     "voxel_colors = scale_to_rgb_cube(average_coef_transformed[1:4].T, clip=3).T\n",
     "print(\"(n_channels, n_voxels) =\", voxel_colors.shape)\n",
     "\n",
-    "ax = plot_3d_flatmap_from_mapper(voxel_colors[0], voxel_colors[1],\n",
-    "                                 voxel_colors[2], mapper_file=mapper_file,\n",
-    "                                 vmin=0, vmax=1, vmin2=0, vmax2=1, vmin3=0,\n",
-    "                                 vmax3=1)\n",
+    "ax = plot_3d_flatmap_from_mapper(\n",
+    "    voxel_colors[0], voxel_colors[1], voxel_colors[2], \n",
+    "    mapper_file=mapper_file, \n",
+    "    vmin=0, vmax=1, vmin2=0, vmax2=1, vmin3=0, vmax3=1\n",
+    ")\n",
     "plt.show()"
    ]
   },
@@ -2379,8 +2398,7 @@
    "source": [
     "## Load the data\n",
     "\n",
-    "We first load the fMRI responses.\n",
-    "\n"
+    "We first load and normalize the fMRI responses."
    ]
   },
   {
@@ -2393,23 +2411,32 @@
    "source": [
     "import os\n",
     "import numpy as np\n",
+    "from scipy.stats import zscore\n",
     "from voxelwise_tutorials.io import load_hdf5_array\n",
     "\n",
     "file_name = os.path.join(directory, \"responses\", f\"{subject}_responses.hdf\")\n",
     "Y_train = load_hdf5_array(file_name, key=\"Y_train\")\n",
     "Y_test = load_hdf5_array(file_name, key=\"Y_test\")\n",
     "\n",
     "print(\"(n_samples_train, n_voxels) =\", Y_train.shape)\n",
-    "print(\"(n_repeats, n_samples_test, n_voxels) =\", Y_test.shape)"
+    "print(\"(n_repeats, n_samples_test, n_voxels) =\", Y_test.shape)\n",
+    "\n",
+    "# indice of first sample of each run\n",
+    "run_onsets = load_hdf5_array(file_name, key=\"run_onsets\")\n",
+    "\n",
+    "# zscore each training run separately\n",
+    "Y_train = np.split(Y_train, run_onsets[1:])\n",
+    "Y_train = np.concatenate([zscore(run, axis=0) for run in Y_train], axis=0)\n",
+    "# zscore each test run separately\n",
+    "Y_test = zscore(Y_test, axis=1)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We average the test repeats, to remove the non-repeatable part of fMRI\n",
-    "responses.\n",
-    "\n"
+    "responses, and normalize the average across repeats."
    ]
   },
   {
@@ -2421,6 +2448,7 @@
    "outputs": [],
    "source": [
     "Y_test = Y_test.mean(0)\n",
+    "Y_test = zscore(Y_test, axis=0)\n",
     "\n",
     "print(\"(n_samples_test, n_voxels) =\", Y_test.shape)"
    ]
@@ -2479,7 +2507,8 @@
     "\n",
     "We define the same leave-one-run-out cross-validation split as in the\n",
     "previous example.\n",
-    "\n"
+    "\n",
+    "We define a cross-validation splitter, compatible with ``scikit-learn`` API."
    ]
   },
   {
@@ -2493,27 +2522,6 @@
     "from sklearn.model_selection import check_cv\n",
     "from voxelwise_tutorials.utils import generate_leave_one_run_out\n",
     "\n",
-    "# indice of first sample of each run\n",
-    "run_onsets = load_hdf5_array(file_name, key=\"run_onsets\")\n",
-    "print(run_onsets)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We define a cross-validation splitter, compatible with ``scikit-learn`` API.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": false
-   },
-   "outputs": [],
-   "source": [
     "n_samples_train = X_train.shape[0]\n",
     "cv = generate_leave_one_run_out(n_samples_train, run_onsets)\n",
     "cv = check_cv(cv)  # copy the cross-validation splitter into a reusable list"
@@ -2964,7 +2972,7 @@
    "source": [
     "## Load the data\n",
     "\n",
-    "We first load the fMRI responses.\n",
+    "We first load and normalize the fMRI responses.\n",
     "\n"
    ]
   },
@@ -2978,23 +2986,32 @@
    "source": [
     "import os\n",
     "import numpy as np\n",
+    "from scipy.stats import zscore\n",
     "from voxelwise_tutorials.io import load_hdf5_array\n",
     "\n",
     "file_name = os.path.join(directory, \"responses\", f\"{subject}_responses.hdf\")\n",
     "Y_train = load_hdf5_array(file_name, key=\"Y_train\")\n",
     "Y_test = load_hdf5_array(file_name, key=\"Y_test\")\n",
     "\n",
     "print(\"(n_samples_train, n_voxels) =\", Y_train.shape)\n",
-    "print(\"(n_repeats, n_samples_test, n_voxels) =\", Y_test.shape)"
+    "print(\"(n_repeats, n_samples_test, n_voxels) =\", Y_test.shape)\n",
+    "\n",
+    "# indice of first sample of each run\n",
+    "run_onsets = load_hdf5_array(file_name, key=\"run_onsets\")\n",
+    "\n",
+    "# zscore each training run separately\n",
+    "Y_train = np.split(Y_train, run_onsets[1:])\n",
+    "Y_train = np.concatenate([zscore(run, axis=0) for run in Y_train], axis=0)\n",
+    "# zscore each test run separately\n",
+    "Y_test = zscore(Y_test, axis=1)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We average the test repeats, to remove the non-repeatable part of fMRI\n",
-    "responses.\n",
-    "\n"
+    "responses, and normalize the average across repeats."
    ]
   },
   {
@@ -3006,6 +3023,7 @@
    "outputs": [],
    "source": [
     "Y_test = Y_test.mean(0)\n",
+    "Y_test = zscore(Y_test, axis=0)\n",
     "\n",
     "print(\"(n_samples_test, n_voxels) =\", Y_test.shape)"
    ]
@@ -3457,28 +3475,35 @@
     "## Load the data\n",
     "\n",
     "As in the previous examples, we first load the fMRI responses, which are our\n",
-    "regression targets.\n",
-    "\n"
+    "regression targets. We then normalize the data independently for each run."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "import os\n",
     "import numpy as np\n",
+    "from scipy.stats import zscore\n",
     "from voxelwise_tutorials.io import load_hdf5_array\n",
     "\n",
     "file_name = os.path.join(directory, \"responses\", f\"{subject}_responses.hdf\")\n",
     "Y_train = load_hdf5_array(file_name, key=\"Y_train\")\n",
     "Y_test = load_hdf5_array(file_name, key=\"Y_test\")\n",
     "\n",
     "print(\"(n_samples_train, n_voxels) =\", Y_train.shape)\n",
-    "print(\"(n_repeats, n_samples_test, n_voxels) =\", Y_test.shape)"
+    "print(\"(n_repeats, n_samples_test, n_voxels) =\", Y_test.shape)\n",
+    "\n",
+    "# indice of first sample of each run\n",
+    "run_onsets = load_hdf5_array(file_name, key=\"run_onsets\")\n",
+    "\n",
+    "# zscore each training run separately\n",
+    "Y_train = np.split(Y_train, run_onsets[1:])\n",
+    "Y_train = np.concatenate([zscore(run, axis=0) for run in Y_train], axis=0)\n",
+    "# zscore each test run separately\n",
+    "Y_test = zscore(Y_test, axis=1)"
    ]
   },
   {
@@ -3511,8 +3536,7 @@
    "metadata": {},
    "source": [
     "We average the test repeats, to remove the non-repeatable part of fMRI\n",
-    "responses.\n",
-    "\n"
+    "responses, and normalize the averaged data."
    ]
   },
   {
@@ -3524,6 +3548,7 @@
    "outputs": [],
    "source": [
     "Y_test = Y_test.mean(0)\n",
+    "Y_test = zscore(Y_test, axis=0)\n",
     "\n",
     "print(\"(n_samples_test, n_voxels) =\", Y_test.shape)"
    ]