ENH scale regression coefficients with sqrt(R^2)

TomDLT · TomDLT · commit bae92ecab5b4 · 2022-05-10T18:56:42.000-07:00
diff --git a/tutorials/notebooks/shortclips/03_plot_wordnet_model.ipynb b/tutorials/notebooks/shortclips/03_plot_wordnet_model.ipynb
@@ -443,7 +443,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Here, we are only interested in the voxels with good generalization\nperformances. We select an arbitrary threshold of 0.05 (R^2 score).\n\n"
+        "Because the ridge model allows a different regularization per voxel, the\nregression coefficients may have very different scales. In turn, these\ndifferent scales can introduce a bias in the interpretation, focusing the\nattention disproportionately on voxels fitted with the lowest alpha. To\naddress this issue, we rescale the regression coefficient to have a norm\nequal to the square-root of the $R^2$ scores. We found empirically that\nthis rescaling best matches results obtained with a regularization shared\naccross voxels. This rescaling also removes the need to select only best\nperforming voxels, because voxels with low prediction accuracies are rescaled\nto have a low norm.\n\n"
       ]
     },
     {
@@ -454,7 +454,7 @@
       },
       "outputs": [],
       "source": [
-        "primal_coef_selection = primal_coef[:, scores > 0.05]"
+        "primal_coef /= np.linalg.norm(primal_coef, axis=0)[None]\nprimal_coef *= np.sqrt(np.maximum(0, scores))[None]"
       ]
     },
     {
@@ -472,7 +472,7 @@
       },
       "outputs": [],
       "source": [
-        "# split the ridge coefficients per delays\ndelayer = pipeline.named_steps['delayer']\nprimal_coef_per_delay = delayer.reshape_by_delays(primal_coef_selection,\n                                                  axis=0)\nprint(\"(n_delays, n_features, n_voxels) =\", primal_coef_per_delay.shape)\n\n# average over delays\naverage_coef = np.mean(primal_coef_per_delay, axis=0)\nprint(\"(n_features, n_voxels) =\", average_coef.shape)"
+        "# split the ridge coefficients per delays\ndelayer = pipeline.named_steps['delayer']\nprimal_coef_per_delay = delayer.reshape_by_delays(primal_coef, axis=0)\nprint(\"(n_delays, n_features, n_voxels) =\", primal_coef_per_delay.shape)\ndel primal_coef\n\n# average over delays\naverage_coef = np.mean(primal_coef_per_delay, axis=0)\nprint(\"(n_features, n_voxels) =\", average_coef.shape)\ndel primal_coef_per_delay"
       ]
     },
     {
@@ -551,7 +551,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "According to the authors of [1]_, \"this principal component distinguishes\nbetween categories with high stimulus energy (e.g. moving objects like\n`person` and `vehicle`) and those with low stimulus energy (e.g. stationary\nobjects like `sky` and `city`)\".\n\nIn this example, because we use only a single subject and we perform a\ndifferent voxel selection, our result is slightly different than in [1]_. We\nalso use a different regularization parameter in each voxel, while in [1]_\nall voxels had the same regularization parameter. We do not aim at\nreproducing exactly the results in [1]_, but we rather describe the general\napproach.\n\n"
+        "According to the authors of [1]_, \"this principal component distinguishes\nbetween categories with high stimulus energy (e.g. moving objects like\n`person` and `vehicle`) and those with low stimulus energy (e.g. stationary\nobjects like `sky` and `city`)\".\n\nIn this example, because we use only a single subject and we perform a\ndifferent voxel selection, our result is slightly different than in the\noriginal publication. We also use a different regularization parameter in\neach voxel, while in [1]_ all voxels had the same regularization parameter.\nHowever, we do not aim at reproducing exactly the results of the original\npublication, but we rather describe the general approach.\n\n"
       ]
     },
     {
@@ -569,7 +569,7 @@
       },
       "outputs": [],
       "source": [
-        "# split the ridge coefficients per delays\nprimal_coef_per_delay = delayer.reshape_by_delays(primal_coef, axis=0)\nprint(\"(n_delays, n_features, n_voxels) =\", primal_coef_per_delay.shape)\ndel primal_coef\n\n# average over delays\naverage_coef = np.mean(primal_coef_per_delay, axis=0)\nprint(\"(n_features, n_voxels) =\", average_coef.shape)\ndel primal_coef_per_delay\n\n# transform with the fitted PCA\naverage_coef_transformed = pca.transform(average_coef.T).T\nprint(\"(n_components, n_voxels) =\", average_coef_transformed.shape)\ndel average_coef\n\n# We make sure vmin = -vmax, so that the colormap is centered on 0.\nvmax = np.percentile(np.abs(average_coef_transformed), 99.9)\n\n# plot the primal weights projected on the first principal component.\nax = plot_flatmap_from_mapper(average_coef_transformed[0], mapper_file,\n                              vmin=-vmax, vmax=vmax, cmap='coolwarm')\nplt.show()"
+        "# transform with the fitted PCA\naverage_coef_transformed = pca.transform(average_coef.T).T\nprint(\"(n_components, n_voxels) =\", average_coef_transformed.shape)\ndel average_coef\n\n# We make sure vmin = -vmax, so that the colormap is centered on 0.\nvmax = np.percentile(np.abs(average_coef_transformed), 99.9)\n\n# plot the primal weights projected on the first principal component.\nax = plot_flatmap_from_mapper(average_coef_transformed[0], mapper_file,\n                              vmin=-vmax, vmax=vmax, cmap='coolwarm')\nplt.show()"
       ]
     },
     {
diff --git a/tutorials/notebooks/shortclips/merged_for_colab.ipynb b/tutorials/notebooks/shortclips/merged_for_colab.ipynb
@@ -2005,8 +2005,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here, we are only interested in the voxels with good generalization\n",
-    "performances. We select an arbitrary threshold of 0.05 (R^2 score).\n",
+    "Because the ridge model allows a different regularization per voxel, the\n",
+    "regression coefficients may have very different scales. In turn, these\n",
+    "different scales can introduce a bias in the interpretation, focusing the\n",
+    "attention disproportionately on voxels fitted with the lowest alpha. To\n",
+    "address this issue, we rescale the regression coefficient to have a norm\n",
+    "equal to the square-root of the $R^2$ scores. We found empirically that\n",
+    "this rescaling best matches results obtained with a regularization shared\n",
+    "accross voxels. This rescaling also removes the need to select only best\n",
+    "performing voxels, because voxels with low prediction accuracies are rescaled\n",
+    "to have a low norm.\n",
     "\n"
    ]
   },
@@ -2018,7 +2026,8 @@
    },
    "outputs": [],
    "source": [
-    "primal_coef_selection = primal_coef[:, scores > 0.05]"
+    "primal_coef /= np.linalg.norm(primal_coef, axis=0)[None]\n",
+    "primal_coef *= np.sqrt(np.maximum(0, scores))[None]"
    ]
   },
   {
@@ -2039,13 +2048,14 @@
    "source": [
     "# split the ridge coefficients per delays\n",
     "delayer = pipeline.named_steps['delayer']\n",
-    "primal_coef_per_delay = delayer.reshape_by_delays(primal_coef_selection,\n",
-    "                                                  axis=0)\n",
+    "primal_coef_per_delay = delayer.reshape_by_delays(primal_coef, axis=0)\n",
     "print(\"(n_delays, n_features, n_voxels) =\", primal_coef_per_delay.shape)\n",
+    "del primal_coef\n",
     "\n",
     "# average over delays\n",
     "average_coef = np.mean(primal_coef_per_delay, axis=0)\n",
-    "print(\"(n_features, n_voxels) =\", average_coef.shape)"
+    "print(\"(n_features, n_voxels) =\", average_coef.shape)\n",
+    "del primal_coef_per_delay"
    ]
   },
   {
@@ -2166,11 +2176,11 @@
     "objects like `sky` and `city`)\".\n",
     "\n",
     "In this example, because we use only a single subject and we perform a\n",
-    "different voxel selection, our result is slightly different than in [1]_. We\n",
-    "also use a different regularization parameter in each voxel, while in [1]_\n",
-    "all voxels had the same regularization parameter. We do not aim at\n",
-    "reproducing exactly the results in [1]_, but we rather describe the general\n",
-    "approach.\n",
+    "different voxel selection, our result is slightly different than in the\n",
+    "original publication. We also use a different regularization parameter in\n",
+    "each voxel, while in [1]_ all voxels had the same regularization parameter.\n",
+    "However, we do not aim at reproducing exactly the results of the original\n",
+    "publication, but we rather describe the general approach.\n",
     "\n"
    ]
   },
@@ -2191,16 +2201,6 @@
    },
    "outputs": [],
    "source": [
-    "# split the ridge coefficients per delays\n",
-    "primal_coef_per_delay = delayer.reshape_by_delays(primal_coef, axis=0)\n",
-    "print(\"(n_delays, n_features, n_voxels) =\", primal_coef_per_delay.shape)\n",
-    "del primal_coef\n",
-    "\n",
-    "# average over delays\n",
-    "average_coef = np.mean(primal_coef_per_delay, axis=0)\n",
-    "print(\"(n_features, n_voxels) =\", average_coef.shape)\n",
-    "del primal_coef_per_delay\n",
-    "\n",
     "# transform with the fitted PCA\n",
     "average_coef_transformed = pca.transform(average_coef.T).T\n",
     "print(\"(n_components, n_voxels) =\", average_coef_transformed.shape)\n",
diff --git a/tutorials/shortclips/03_plot_wordnet_model.py b/tutorials/shortclips/03_plot_wordnet_model.py
@@ -331,22 +331,32 @@
 print("(n_delays * n_features, n_voxels) =", primal_coef.shape)
 
 ###############################################################################
-# Here, we are only interested in the voxels with good generalization
-# performances. We select an arbitrary threshold of 0.05 (R^2 score).
-primal_coef_selection = primal_coef[:, scores > 0.05]
+# Because the ridge model allows a different regularization per voxel, the
+# regression coefficients may have very different scales. In turn, these
+# different scales can introduce a bias in the interpretation, focusing the
+# attention disproportionately on voxels fitted with the lowest alpha. To
+# address this issue, we rescale the regression coefficient to have a norm
+# equal to the square-root of the :math:`R^2` scores. We found empirically that
+# this rescaling best matches results obtained with a regularization shared
+# accross voxels. This rescaling also removes the need to select only best
+# performing voxels, because voxels with low prediction accuracies are rescaled
+# to have a low norm.
+primal_coef /= np.linalg.norm(primal_coef, axis=0)[None]
+primal_coef *= np.sqrt(np.maximum(0, scores))[None]
 
 ###############################################################################
 # Then, we aggregate the coefficients across the different delays.
 
 # split the ridge coefficients per delays
 delayer = pipeline.named_steps['delayer']
-primal_coef_per_delay = delayer.reshape_by_delays(primal_coef_selection,
-                                                  axis=0)
+primal_coef_per_delay = delayer.reshape_by_delays(primal_coef, axis=0)
 print("(n_delays, n_features, n_voxels) =", primal_coef_per_delay.shape)
+del primal_coef
 
 # average over delays
 average_coef = np.mean(primal_coef_per_delay, axis=0)
 print("(n_features, n_voxels) =", average_coef.shape)
+del primal_coef_per_delay
 
 ###############################################################################
 # Even after averaging over delays, the coefficient matrix is still too large
@@ -405,26 +415,16 @@
 # objects like `sky` and `city`)".
 #
 # In this example, because we use only a single subject and we perform a
-# different voxel selection, our result is slightly different than in [1]_. We
-# also use a different regularization parameter in each voxel, while in [1]_
-# all voxels had the same regularization parameter. We do not aim at
-# reproducing exactly the results in [1]_, but we rather describe the general
-# approach.
+# different voxel selection, our result is slightly different than in the
+# original publication. We also use a different regularization parameter in
+# each voxel, while in [1]_ all voxels had the same regularization parameter.
+# However, we do not aim at reproducing exactly the results of the original
+# publication, but we rather describe the general approach.
 
 ###############################################################################
 # To project the principal component on the cortical surface, we first need to
 # use the fitted PCA to transform the primal weights of all voxels.
 
-# split the ridge coefficients per delays
-primal_coef_per_delay = delayer.reshape_by_delays(primal_coef, axis=0)
-print("(n_delays, n_features, n_voxels) =", primal_coef_per_delay.shape)
-del primal_coef
-
-# average over delays
-average_coef = np.mean(primal_coef_per_delay, axis=0)
-print("(n_features, n_voxels) =", average_coef.shape)
-del primal_coef_per_delay
-
 # transform with the fitted PCA
 average_coef_transformed = pca.transform(average_coef.T).T
 print("(n_components, n_voxels) =", average_coef_transformed.shape)

Original file line number	Diff line number	Diff line change
`@@ -443,7 +443,7 @@`
`443`	`443`	`"cell_type": "markdown",`
`444`	`444`	`"metadata": {},`
`445`	`445`	`"source": [`
`446`		`- "Here, we are only interested in the voxels with good generalization\nperformances. We select an arbitrary threshold of 0.05 (R^2 score).\n\n"`
	`446`	+ "Because the ridge model allows a different regularization per voxel, the\nregression coefficients may have very different scales. In turn, these\ndifferent scales can introduce a bias in the interpretation, focusing the\nattention disproportionately on voxels fitted with the lowest alpha. To\naddress this issue, we rescale the regression coefficient to have a norm\nequal to the square-root of the $R^2$ scores. We found empirically that\nthis rescaling best matches results obtained with a regularization shared\naccross voxels. This rescaling also removes the need to select only best\nperforming voxels, because voxels with low prediction accuracies are rescaled\nto have a low norm.\n\n"
`447`	`447`	`]`
`448`	`448`	`},`
`449`	`449`	`{`
`@@ -454,7 +454,7 @@`
`454`	`454`	`},`
`455`	`455`	`"outputs": [],`
`456`	`456`	`"source": [`
`457`		`- "primal_coef_selection = primal_coef[:, scores > 0.05]"`
	`457`	`+ "primal_coef /= np.linalg.norm(primal_coef, axis=0)[None]\nprimal_coef *= np.sqrt(np.maximum(0, scores))[None]"`
`458`	`458`	`]`
`459`	`459`	`},`
`460`	`460`	`{`
`@@ -472,7 +472,7 @@`
`472`	`472`	`},`
`473`	`473`	`"outputs": [],`
`474`	`474`	`"source": [`
`475`		`- "# split the ridge coefficients per delays\ndelayer = pipeline.named_steps['delayer']\nprimal_coef_per_delay = delayer.reshape_by_delays(primal_coef_selection,\n axis=0)\nprint(\"(n_delays, n_features, n_voxels) =\", primal_coef_per_delay.shape)\n\n# average over delays\naverage_coef = np.mean(primal_coef_per_delay, axis=0)\nprint(\"(n_features, n_voxels) =\", average_coef.shape)"`
	`475`	`+ "# split the ridge coefficients per delays\ndelayer = pipeline.named_steps['delayer']\nprimal_coef_per_delay = delayer.reshape_by_delays(primal_coef, axis=0)\nprint(\"(n_delays, n_features, n_voxels) =\", primal_coef_per_delay.shape)\ndel primal_coef\n\n# average over delays\naverage_coef = np.mean(primal_coef_per_delay, axis=0)\nprint(\"(n_features, n_voxels) =\", average_coef.shape)\ndel primal_coef_per_delay"`
`476`	`476`	`]`
`477`	`477`	`},`
`478`	`478`	`{`
`@@ -551,7 +551,7 @@`
`551`	`551`	`"cell_type": "markdown",`
`552`	`552`	`"metadata": {},`
`553`	`553`	`"source": [`
`554`		- "According to the authors of [1]_, \"this principal component distinguishes\nbetween categories with high stimulus energy (e.g. moving objects like\n`person` and `vehicle`) and those with low stimulus energy (e.g. stationary\nobjects like `sky` and `city`)\".\n\nIn this example, because we use only a single subject and we perform a\ndifferent voxel selection, our result is slightly different than in [1]_. We\nalso use a different regularization parameter in each voxel, while in [1]_\nall voxels had the same regularization parameter. We do not aim at\nreproducing exactly the results in [1]_, but we rather describe the general\napproach.\n\n"
	`554`	+ "According to the authors of [1]_, \"this principal component distinguishes\nbetween categories with high stimulus energy (e.g. moving objects like\n`person` and `vehicle`) and those with low stimulus energy (e.g. stationary\nobjects like `sky` and `city`)\".\n\nIn this example, because we use only a single subject and we perform a\ndifferent voxel selection, our result is slightly different than in the\noriginal publication. We also use a different regularization parameter in\neach voxel, while in [1]_ all voxels had the same regularization parameter.\nHowever, we do not aim at reproducing exactly the results of the original\npublication, but we rather describe the general approach.\n\n"
`555`	`555`	`]`
`556`	`556`	`},`
`557`	`557`	`{`
`@@ -569,7 +569,7 @@`
`569`	`569`	`},`
`570`	`570`	`"outputs": [],`
`571`	`571`	`"source": [`
`572`		- "# split the ridge coefficients per delays\nprimal_coef_per_delay = delayer.reshape_by_delays(primal_coef, axis=0)\nprint(\"(n_delays, n_features, n_voxels) =\", primal_coef_per_delay.shape)\ndel primal_coef\n\n# average over delays\naverage_coef = np.mean(primal_coef_per_delay, axis=0)\nprint(\"(n_features, n_voxels) =\", average_coef.shape)\ndel primal_coef_per_delay\n\n# transform with the fitted PCA\naverage_coef_transformed = pca.transform(average_coef.T).T\nprint(\"(n_components, n_voxels) =\", average_coef_transformed.shape)\ndel average_coef\n\n# We make sure vmin = -vmax, so that the colormap is centered on 0.\nvmax = np.percentile(np.abs(average_coef_transformed), 99.9)\n\n# plot the primal weights projected on the first principal component.\nax = plot_flatmap_from_mapper(average_coef_transformed[0], mapper_file,\n vmin=-vmax, vmax=vmax, cmap='coolwarm')\nplt.show()"
	`572`	+ "# transform with the fitted PCA\naverage_coef_transformed = pca.transform(average_coef.T).T\nprint(\"(n_components, n_voxels) =\", average_coef_transformed.shape)\ndel average_coef\n\n# We make sure vmin = -vmax, so that the colormap is centered on 0.\nvmax = np.percentile(np.abs(average_coef_transformed), 99.9)\n\n# plot the primal weights projected on the first principal component.\nax = plot_flatmap_from_mapper(average_coef_transformed[0], mapper_file,\n vmin=-vmax, vmax=vmax, cmap='coolwarm')\nplt.show()"
`573`	`573`	`]`
`574`	`574`	`},`
`575`	`575`	`{`