Introduction: Principles of careful data analysis

This module sketches some of the overall principles that inform careful data analysis.

If this is your first contact with analysis of neural data and/or with MATLAB, some of the more technical points below may not resonate with you. This is OK – for now, focus on the ideas as you read through. Later modules will refer back to these principles, so as you build up your experience you can revisit this page.

Conversely, if you are an experienced analyst, what follows will be likely be familiar. In either case, please feel free to contribute your thoughts, questions, and contributions either by using the Discussion panel below, or by editing the wiki directly!

Principles

1. Garbage in, garbage out

Analysis will be meaningless if performed on bad data. Even if you start out with good data, there are many analysis steps that have the power to corrupt.

An important corollary of this principle is that you need to determine at every step whether you are dealing with garbage or not. Two habits that help with this are visualization (explored in Module 3) and unit testing (put simply, the practice of testing specific pieces of functionality or “units”; employed throughout the modules).

To see why this principle is critical, consider an analogy: a complex multistep experimental procedure such as surgically implanting a recording probe into the brain. In this setting, the surgeon always verifies the success of the previous step before proceeding. It would be disastrous to attempt to insert a delicate probe without making sure the skull bone is removed first; or to apply dental cement to the skull without first making sure the craniotomy is sealed! Apply the same mindset to analysis and confirm the success of every step before proceeding!

2. Plan ahead (from raw data to result)

Before the start of data collection, you should identify the steps in your data processing “pipeline” – that is, the flow from raw data to the figures in the resulting publication. Doing this can often highlight key dependencies and potentially important controls that help you collect the data such that you will can actually test what you set out to do.

This sort of planning is especially important when performing experiments with long timelines that are not easily changed, such as when chronically implanting animals for in vivo recording, where it may take up to two months to collect data from a single animal. For smaller projects or those with faster iteration times (e.g. a new slice every day) you can be more flexible.

There are two steps to this planning process:

First, think in terms of data, and transformations on those data, to create a schematic that illustrates your analysis workflow at a conceptual level.

For instance, to determine whether the number of sharp wave-ripple complexes (SWRs; these are candidate “replay” events in the hippocampus) that occur depends on an experimental manipulation, a possible analysis workflow might be represented as follows (generated with the DokuWiki plugin for GraphViz):

The above workflow shows how raw local field potential (LFP) data is first loaded (by the LoadCSC() function) and then filtered (FilterLFP()). Note that at this stage, you can simply make up function names, as long as they are descriptive (see Principle 3, below). Next, SWR events are detected from the filtered LFP, and the number for each trial counted before applying a statistical test.

The square brackets such as [TSD] refer to standardized data types, introduced in Module 2. Briefly, a TSD object describes one or more time-varying signals (such as LFP or videotracker data), an IV object describes interval data (such as SWR events, which have a start and end time as well as some properties such as their power), and a TS object describes timestamps (such as spike times). By standardizing the form in which these data types are handled, we can more easily implement unit tests and write clean, modular code.

Second: based on a data analysis workflow such as the above, write out some example pseudocode that would implement the analysis in MATLAB. For the workflow above, this might look something like:

% load a data file
cfg = []; cfg.fc = {'R042-2013-08-18-CSC03a.ncs'}; % specify which file to load
LFP = LoadCSC(cfg);
 
% filter the data in the ripple band (150-220Hz)
cfg = []; cfg.f = [150 220]; % specify passband
LFPfilt = FilterLFP(cfg,LFP);
 
% extract the signal envelope by Hilbert transform
 
% detect intervals that pass a threshold
cfg = []; cfg.method = 'zscore';
cfg.threshold = 5; cfg.select =  '>'; % return intervals where threshold is exceeded
 
SWR = MakeIV(cfg,LFPfilt); % make intervals (corresponding to SWR events)

Note that each analysis step is implemented by a function, with a cfg struct to specify some parameters of the transformation (e.g. the frequency band to filter). The overall workflow is accomplished by calling the appropriate functions on evolving data types. Perhaps some of the functions you need already exist, or you may need to write some of them. Either way, making the analysis steps explicit in this way provides a good starting point for writing well-organized code.

3. Use good programming practice

There are many resources and opinions on what constitutes good programming practice. A few of the most important are:

4. Write to share

A desirable endpoint of a successful analysis is that you can share the code and the raw data with anyone, and they will be able to generate all the figures and results in the paper. A nice example is Bekolay et al. J Neurosci 2014 where the Notes section gives the link to a GitHub release with very nicely organized and documented code that reproduces the results.

This means, among other things, that:

5. Be safe

Disk, computer, and connection failures happen, usually when you are least prepared. Take steps to ensure that you don't lose more than a couple of hours of work, and that you NEVER lose data!

In our lab, we have a data vault which stores (1) incoming data, which you should upload as soon as you have finished collecting it, and (2) promoted data, which consists of fully pre-processed and annotated data ready for further analysis and sharing. Instructions for how to access the lab data vault are here.

The data on this data vault is stored on a redundant disk array, and periodically backed up to Amazon Glacier. However, you should always make sure you have at least one other copy of your data in a different location, such as your office computer, an external hard drive, and/or a cloud location.

For automatic backups of your analysis code in progress, I use a combination of Dropbox and GitHub (discussed in more detail in Module 1).

6. Statistical concepts: overfitting, cross-validation, resampling, model comparison

Many data analysis projects will eventually require the use of statistics; arguably, for most projects, careful consideration of what statistics will eventually be done (such as whether your experimental design and power are appropriate for your question) should begin before you collect any data at all. As such, you should be aware of major statistical concepts, which we will encounter in several places throughout this course. These include, but are not limited to:

7. Test on synthetic data

Analysis pipelines can get complicated quickly, such that it can be difficult to track down where things may be going wrong. A great tool to verify the integrity of single analysis steps, as well as entire workflows, is to test on data you generate, such that you know what the answer should be. For instance, if you input Poisson (random) spike data with a constant firing rate, totally independent of your experimental conditions, it better not be the case that your analysis reports a significant difference!