User Tools

Site Tools


analysis:nsb2015:week0

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
analysis:nsb2015:week0 [2016/01/04 17:11]
course-w16 [5. Be safe]
analysis:nsb2015:week0 [2023/02/02 14:29] (current)
mvdm
Line 13: Line 13:
 Analysis will be meaningless if performed on bad data. Even if you start out with good data, there are many analysis steps that have the power to corrupt. Analysis will be meaningless if performed on bad data. Even if you start out with good data, there are many analysis steps that have the power to corrupt.
  
-//An important corollary of this principle is that you need to determine at every step whether you are dealing with garbage or not//. Two habits that help with this are visualization (explored in [[analysis:​nsb2015:​week5|Module ​5]]) and unit testing (put simply, the practice of testing specific pieces of functionality or "​units";​ employed throughout the modules). ​+//An important corollary of this principle is that you need to determine at every step whether you are dealing with garbage or not//. Two habits that help with this are visualization (explored in Module ​3) and unit testing (put simply, the practice of testing specific pieces of functionality or "​units";​ employed throughout the modules). ​
  
-To see why this principle is critical, consider a complex multistep experimental procedure such as surgically implanting a recording probe into the brain. In this setting, the surgeon //always// verifies the success of the previous step before proceeding. It would be disastrous to attempt to insert a probe without making sure skull bone, and perhaps dura, is removed first; or to apply dental cement to the skull without first making sure it is dry. Apply the same mindset to analysis and confirm the success of every step before proceeding!+To see why this principle is critical, consider ​an analogy: ​a complex multistep experimental procedure such as surgically implanting a recording probe into the brain. In this setting, the surgeon //always// verifies the success of the previous step before proceeding. It would be disastrous to attempt to insert a delicate ​probe without making sure the skull bone is removed first; or to apply dental cement to the skull without first making sure the craniotomy ​is sealed! ​Apply the same mindset to analysis and confirm the success of every step before proceeding!
  
 === 2. Plan ahead (from raw data to result) === === 2. Plan ahead (from raw data to result) ===
Line 89: Line 89:
   * //​Don'​t repeat yourself//. Implementing each piece of functionality only once means your code will be easier to troubleshoot,​ re-use, and extend -- as well as easier to read.   * //​Don'​t repeat yourself//. Implementing each piece of functionality only once means your code will be easier to troubleshoot,​ re-use, and extend -- as well as easier to read.
   * //Unit testing//. Provide test scenarios with key pieces of code where you know what the expected outcome is. For data analysis this commonly involves generating artificial data such as white noise or Poisson spike trains of a certain average firing rate. These tests will be extremely helpful in interpreting your data later, and to check if changes you make to the code have not broken its functionality.   * //Unit testing//. Provide test scenarios with key pieces of code where you know what the expected outcome is. For data analysis this commonly involves generating artificial data such as white noise or Poisson spike trains of a certain average firing rate. These tests will be extremely helpful in interpreting your data later, and to check if changes you make to the code have not broken its functionality.
-  * //​Readability//​. Generally, whatever analysis you are doing, you will probably have to do again. Maybe on the same data after you make a change to the code, maybe after you collect more data. Maybe tomorrow, maybe next year. It is tempting to assume you will remember what you did and why, but this will not always be the case! Plus, even if //you// do, it's likely someone else (such as your adviser, or a collaborator) will have to run and understand your code. Whether or not they can will reflect on you.  +  * //​Readability//​. Generally, whatever analysis you are doing, you will probably have to do again. Maybe on the same data after you make a change to the code, maybe after you collect more data. Maybe tomorrow, maybe next year. It is tempting to assume you will remember what you did and why, but this will not always be the case! Plus, even if //you// do, it's likely someone else (such as your adviser, or a collaborator) will have to run and understand your code. Whether or not they can will reflect on you. Write code to make it easy to understand, even if it takes a few more lines of code. Good documentation is crucial
-  * //​Consistency//​. Use consistent naming schemes for different kinds of variables and functions; always place constants and parameters at the start of each file.+  * //​Consistency//​. Use consistent naming schemes for different kinds of variables and functions; avoid hard-coding things that are actually parameters (even if you are unlikely to change them, explicitly defining them makes it easier to be aware of when you might want to change them and improves readability); always place constants and parameters at the start of each file
 +  * //​Robustness//​. Build in explicit checks for various scenarios that you suspect might cause the code to break. Test assumptions.
  
 === 4. Write to share === === 4. Write to share ===
  
-A desirable endpoint of a successful analysis is that you can share the code and the raw data with anyone, and they will be able to generate all the figures and results in the paper.+A desirable endpoint of a successful analysis is that you can share the code and the raw data with anyone, and they will be able to generate all the figures and results in the paper. A nice example is [[http://​www.jneurosci.org/​content/​34/​5/​1892.full | Bekolay et al. J Neurosci 2014]] where the Notes section gives the link to a %%GitHub%% release with very nicely organized and documented code that reproduces the results.
  
 This means, among other things, that: This means, among other things, that:
  
-  * Annotate your data. We use two annotation files, %%ExpKeys%% and metadata, which contain a number of mandatory descriptors common across all our lab's tasks (hat-tip to A. David Redish -- MvdM) as well as more experiment-specific information. Our current lab-general specification for these files can be found [[https://​github.com/​mvdm/​vandermeerlab/​blob/​master/​doc/​HOWTO_ExpKeys_Metadata.md|here]],​ and task-specific descriptors can be found [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​dataanalysis#​task_descriptions_and_metadata|here]]. +  * Annotate your data. We use two annotation files, %%ExpKeys%% and metadata, which contain a number of mandatory descriptors common across all our lab's tasks as well as more experiment-specific information. Our current lab-general specification for these files can be found [[https://​github.com/​mvdm/​vandermeerlab/​blob/​master/​doc/​HOWTO_ExpKeys_Metadata.md|here]],​ and task-specific descriptors can be found [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​dataanalysis#​task_descriptions_and_metadata|here]]. As a postdoc in the Redish lab, a similar standardized annotation system enabled me to analyze and compare three large data sets, recorded by three different people from different brain regions ([[http://​www.cell.com/​neuron/​abstract/​S0896-6273(10)00507-6 | van der Meer et al. Neuron 2010]])
-  * Don't hard-code the locations of any files. Follow the [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​nsb2015:​week2#​data_files_overview|database format and file naming conventions]] so that it is sufficient to specify the root folder where the data are located.+  * Don't hard-code the locations of any files. Follow the [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​course-w16:​week2#​data_files_overview|database format and file naming conventions]] so that it is sufficient to specify the root folder where the data are located.
   * Be explicit about what version numbers of various pieces of software you used to generate the results. Taken to the limit, this means also specifying the exact operating system version and shared libraries -- an issue best addressed by including an image or virtual machine (see e.g. [[http://​www.russpoldrack.org/​2015_12_01_archive.html|this blogpost]] by Russ Poldrack for discussion). A nice way to handle this with respect to code on %%GitHub%% is to create a [[https://​help.github.com/​articles/​creating-releases/​|release]] for a publication (essentially an easily linked to snapshot of the code on the repository).   * Be explicit about what version numbers of various pieces of software you used to generate the results. Taken to the limit, this means also specifying the exact operating system version and shared libraries -- an issue best addressed by including an image or virtual machine (see e.g. [[http://​www.russpoldrack.org/​2015_12_01_archive.html|this blogpost]] by Russ Poldrack for discussion). A nice way to handle this with respect to code on %%GitHub%% is to create a [[https://​help.github.com/​articles/​creating-releases/​|release]] for a publication (essentially an easily linked to snapshot of the code on the repository).
 +
 === 5. Be safe === === 5. Be safe ===
  
 Disk, computer, and connection failures happen, usually when you are least prepared. Take steps to ensure that you don't lose more than a couple of hours of work, and that you NEVER lose data! Disk, computer, and connection failures happen, usually when you are least prepared. Take steps to ensure that you don't lose more than a couple of hours of work, and that you NEVER lose data!
  
-In our lab, we have a data vault which stores (1) incoming data, which you should upload as soon as you have finished collecting it, and (2) [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​nsb2015:​week2#​data_files_overview|promoted data]], which consists of fully pre-processed and annotated data ready for further analysis and sharing. Instructions for how to access the lab data vault are [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​nsb2015:​week1#​grab_a_data_session_from_the_lab_database|here]].+In our lab, we have a data vault which stores (1) incoming data, which you should upload as soon as you have finished collecting it, and (2) [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​course-w16:​week2#​data_files_overview|promoted data]], which consists of fully pre-processed and annotated data ready for further analysis and sharing. Instructions for how to access the lab data vault are [[http://​ctnsrv.uwaterloo.ca/​vandermeerlab/​doku.php?​id=analysis:​course-w16:​week1#​grab_a_data_session_from_the_lab_database|here]].
  
 The data on this data vault is stored on a redundant disk array, and periodically backed up to Amazon Glacier. However, you should always make sure you have at least one other copy of your data in a different location, such as your office computer, an external hard drive, and/or a cloud location. The data on this data vault is stored on a redundant disk array, and periodically backed up to Amazon Glacier. However, you should always make sure you have at least one other copy of your data in a different location, such as your office computer, an external hard drive, and/or a cloud location.
Line 111: Line 113:
 For automatic backups of your analysis code in progress, I use a combination of Dropbox and GitHub (discussed in more detail in Module 1). For automatic backups of your analysis code in progress, I use a combination of Dropbox and GitHub (discussed in more detail in Module 1).
  
-=== 6. Statistical concepts: overfitting,​ cross-validation, ​permutation tests, model comparison ===+=== 6. Statistical concepts: overfitting,​ cross-validation, ​resampling, model comparison === 
 + 
 +Many data analysis projects will eventually require the use of statistics; arguably, for most projects, careful consideration of what statistics will eventually be done (such as whether your experimental design and power are appropriate for your question) should begin before you collect any data at all. As such, you should be aware of major statistical concepts, which we will encounter in several places throughout this course. These include, but are not limited to: 
 + 
 +  * [[https://​www.quora.com/​What-is-an-intuitive-explanation-of-overfitting|Overfitting]]:​ modeling noise instead of the process of interest. Do not do this. 
 +  * [[https://​www.quora.com/​What-is-an-intuitive-explanation-of-cross-validation|Cross-validation]]:​ a powerful, general purpose tool for evaluating the "​goodness"​ of a statistical model (and prevent overfitting). 
 +  * [[https://​en.wikipedia.org/​wiki/​Resampling_(statistics)|Resampling]] (aka bootstrapping,​ shuffling, permutation testing): generating synthetic data sets based on some known distribution,​ usually to compare to actual data. 
 +  * Model comparison: the process of determining which model best describes the data.
  
 === 7. Test on synthetic data === === 7. Test on synthetic data ===
 +
 +Analysis pipelines can get complicated quickly, such that it can be difficult to track down where things may be going wrong. A great tool to verify the integrity of single analysis steps, as well as entire workflows, is to test on data you generate, such that you know what the answer should be. For instance, if you input Poisson (random) spike data with a constant firing rate, totally independent of your experimental conditions, it better not be the case that your analysis reports a significant difference!
analysis/nsb2015/week0.1451945470.txt.gz · Last modified: 2018/07/07 10:19 (external edit)