User Tools

Site Tools


analysis:nsb2015:week0

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
analysis:nsb2015:week0 [2016/01/04 17:35]
course-w16 [7. Test on synthetic data]
analysis:nsb2015:week0 [2023/02/02 14:29] (current)
mvdm
Line 95: Line 95:
 === 4. Write to share === === 4. Write to share ===
  
-A desirable endpoint of a successful analysis is that you can share the code and the raw data with anyone, and they will be able to generate all the figures and results in the paper.+A desirable endpoint of a successful analysis is that you can share the code and the raw data with anyone, and they will be able to generate all the figures and results in the paper. A nice example is [[http://​www.jneurosci.org/​content/​34/​5/​1892.full | Bekolay et al. J Neurosci 2014]] where the Notes section gives the link to a %%GitHub%% release with very nicely organized and documented code that reproduces the results.
  
 This means, among other things, that: This means, among other things, that:
  
-  * Annotate your data. We use two annotation files, %%ExpKeys%% and metadata, which contain a number of mandatory descriptors common across all our lab's tasks (hat-tip to A. David Redish -- MvdM) as well as more experiment-specific information. Our current lab-general specification for these files can be found [[https://​github.com/​mvdm/​vandermeerlab/​blob/​master/​doc/​HOWTO_ExpKeys_Metadata.md|here]],​ and task-specific descriptors can be found [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​dataanalysis#​task_descriptions_and_metadata|here]].+  * Annotate your data. We use two annotation files, %%ExpKeys%% and metadata, which contain a number of mandatory descriptors common across all our lab's tasks as well as more experiment-specific information. Our current lab-general specification for these files can be found [[https://​github.com/​mvdm/​vandermeerlab/​blob/​master/​doc/​HOWTO_ExpKeys_Metadata.md|here]],​ and task-specific descriptors can be found [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​dataanalysis#​task_descriptions_and_metadata|here]]. As a postdoc in the Redish lab, a similar standardized annotation system enabled me to analyze and compare three large data sets, recorded by three different people from different brain regions ([[http://​www.cell.com/​neuron/​abstract/​S0896-6273(10)00507-6 | van der Meer et al. Neuron 2010]]).
   * Don't hard-code the locations of any files. Follow the [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​course-w16:​week2#​data_files_overview|database format and file naming conventions]] so that it is sufficient to specify the root folder where the data are located.   * Don't hard-code the locations of any files. Follow the [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​course-w16:​week2#​data_files_overview|database format and file naming conventions]] so that it is sufficient to specify the root folder where the data are located.
   * Be explicit about what version numbers of various pieces of software you used to generate the results. Taken to the limit, this means also specifying the exact operating system version and shared libraries -- an issue best addressed by including an image or virtual machine (see e.g. [[http://​www.russpoldrack.org/​2015_12_01_archive.html|this blogpost]] by Russ Poldrack for discussion). A nice way to handle this with respect to code on %%GitHub%% is to create a [[https://​help.github.com/​articles/​creating-releases/​|release]] for a publication (essentially an easily linked to snapshot of the code on the repository).   * Be explicit about what version numbers of various pieces of software you used to generate the results. Taken to the limit, this means also specifying the exact operating system version and shared libraries -- an issue best addressed by including an image or virtual machine (see e.g. [[http://​www.russpoldrack.org/​2015_12_01_archive.html|this blogpost]] by Russ Poldrack for discussion). A nice way to handle this with respect to code on %%GitHub%% is to create a [[https://​help.github.com/​articles/​creating-releases/​|release]] for a publication (essentially an easily linked to snapshot of the code on the repository).
 +
 === 5. Be safe === === 5. Be safe ===
  
Line 116: Line 117:
 Many data analysis projects will eventually require the use of statistics; arguably, for most projects, careful consideration of what statistics will eventually be done (such as whether your experimental design and power are appropriate for your question) should begin before you collect any data at all. As such, you should be aware of major statistical concepts, which we will encounter in several places throughout this course. These include, but are not limited to: Many data analysis projects will eventually require the use of statistics; arguably, for most projects, careful consideration of what statistics will eventually be done (such as whether your experimental design and power are appropriate for your question) should begin before you collect any data at all. As such, you should be aware of major statistical concepts, which we will encounter in several places throughout this course. These include, but are not limited to:
  
-  * [[https://​www.quora.com/​What-is-an-intuitive-explanation-of-overfitting|Overfitting]]:​ modeling noise instead of the process of interest +  * [[https://​www.quora.com/​What-is-an-intuitive-explanation-of-overfitting|Overfitting]]:​ modeling noise instead of the process of interest. Do not do this. 
-  * [[https://​www.quora.com/​What-is-an-intuitive-explanation-of-cross-validation|Cross-validation]]:​ a powerful, general purpose tool for evaluating the "​goodness"​ of a statistical model (and prevent overfitting) +  * [[https://​www.quora.com/​What-is-an-intuitive-explanation-of-cross-validation|Cross-validation]]:​ a powerful, general purpose tool for evaluating the "​goodness"​ of a statistical model (and prevent overfitting). 
-  * [[https://​en.wikipedia.org/​wiki/​Resampling_(statistics)|Resampling]] (aka bootstrapping,​ shuffling, permutation testing): generating synthetic data sets based on some known distribution,​ usually to compare to actual data +  * [[https://​en.wikipedia.org/​wiki/​Resampling_(statistics)|Resampling]] (aka bootstrapping,​ shuffling, permutation testing): generating synthetic data sets based on some known distribution,​ usually to compare to actual data. 
-  * Model comparison: the process of determining which model best describes the data+  * Model comparison: the process of determining which model best describes the data
 === 7. Test on synthetic data === === 7. Test on synthetic data ===
  
-Analysis pipelines can get complicated quickly, such that it can be difficult to track down where things may be going wrong. A great way to detect possible issues ​is to run it on data where you know what the outcome ​should be -- because ​you created the data! Common examples are to generate random (Poisson) spike trainswhich by design are not related to any aspect ​of task, behavioror neural activity elsewhere. This better not have any significant relationship to anything else according to your analysis!+Analysis pipelines can get complicated quickly, such that it can be difficult to track down where things may be going wrong. A great tool to verify the integrity of single analysis steps, as well as entire workflows, ​is to test on data you generate, such that you know what the answer ​should be. For instance, if you input Poisson ​(random) spike data with a constant firing ratetotally independent ​of your experimental conditionsit better not be the case that your analysis ​reports a significant difference!
analysis/nsb2015/week0.1451946956.txt.gz · Last modified: 2018/07/07 10:19 (external edit)