Reference
HowToGuides
Manuals
LabAlumni
DataAnalysis
Advice for...
Admin
~~DISCUSSION~~
Goals:
Resources:
Careful analysis of neural data begins with a thorough understanding of the raw data that is saved by the data acquisition system (usually referred to here as “Neuralynx” or “Cheetah”). However, raw data is only rarely suitable for immediate analysis. At a minimum, freshly acquired data sets must be annotated and the files systematically renamed – for instance, with the ID of the experimental subject and some information about recording locations – so that the analyst can select which files to analyze. More complex pre-processing steps inlcude spike sorting, the process of assigning spike waveforms to putative single neurons to obtain their spike times.
First, make sure you have the folder R042-2013-08-18
from the lab database (or NS&B share), and that this is placed in a sensible location (NOT in a GitHub or project folder! See Module 1 if this is not obvious). This folder contains data from a single recording session that has been pre-processed so that it is ready for analysis. Such a pre-processed data set is referred to as “promoted”; raw data that has just been recorded is “incoming”, data being pre-processed is “inProcess”. The schematic below gives an overview of the major data files and their transformation during pre-processing:
The files you find in a promoted folder such as R042-2013-08-18
are those enclosed in the gray box. They are:
.ncs
file (“Neuralynx Continuously Sampled”) contains a single channel of continuously sampled voltage data. The sampling rate and filters for these channels can be configured in the Cheetah data acquisition software. Typically, as in this data set, the sampling rate and filters are set so that these files are local field potentials (LFPs) sampled at 2kHz and filtered between 1 and 475 Hz. It is also possible to have wide-band, 32kHz .ncs
files suitable for spike extraction, but these are not included in the current dataset..t
file contains a set of times – a spike train from a putative neuron. The qualifier “putative” is used because this is extracellular data and spike-sorting is not perfect, so it's likely there will be some spikes missing and some spikes included that are not from this neuron. Always remember this even if I will omit the “putative” from now on for short! *.t
files are generated by MClust, a spike sorting tool developed by A. David Redish, from the raw *.ntt
(“Neuralynx TeTrode”) files saved by Neuralynx. *.ntt
files do not contain continuously sampled data; instead, a one-millisecond snapshot across the channels of a tetrode is stored whenever any of the four channels exceeds a threshold set in Cheetah by the experimenter.*.nvt
file (“Neuralynx Video Tracking”) contains the location of the rat as tracked by an overhead camera. For Neuralynx systems, this is typically sampled at 30 Hz. Because the raw files are large, they are usually stored in compressed (zip) format, or loaded into MATLAB and saved as a .mat
file. These files are in units of camera pixels (640×480 in this case).*.Nev
file (“Neuralynx EVents”) contains timestamps and labels of events, such as those input by the user during recording, received from experimental components connected to Neuralynx's digital I/O (Input/Output) port, and system messages such as recording start, data loss, et cetera.A critical part of any data set is the following:
*keys.m
file, referred to as “ExpKeys” or “keys”. This file contains experimenter-provided information that describes this data set. This information is stored as a .m
file so that it can be edited and read by standard text editors (rather than having to be loaded into MATLAB to view, as would be the case for a .mat
file). This file and the correct format for ExpKeys will be explored in detail later. In brief, it defines a MATLAB structure named ExpKeys
with fields that describe what was done in the current session.Finally, there is also:
*wv.mat
files. There is one file for each *.t
file, containing the average waveforms for that cell.*ClusterQual.mat
files. Also, one file for each *.t
file, containing some cluster quality statistics.
Both of these files are generated by a MATLAB script (CreateCQFile.m
) or directly from MClust version 4.1 or higher.
☛ Look at the contents of the R042-2013-08-18
folder.
Notice how each file is named: all start with R042-2013-08-18
followed by a suffix indicating the file type and (if necessary) an identifier. Applying this naming scheme consistently is a key part of good data management because it enables provenance tracking – which cells from what animal, what sesssion, and what condition are contributing to each plot, what cluster did it come from, et cetera. The rename steps in the above schematic are a critical first step.
Neuralynx supplies a set of functions that load the raw data into MATLAB (included in your GitHub clone). We will use these one by one in the following subsections. A common theme is that all of these functions will output a Timestamps
variable, indicating when each data sample or event occurred. Data acquisition systems need to solve the engineering challenge of aligning many different kinds of signals (video, neural activity, events) on a common timebase, so that relationships between them can be analyzed. These Timestamps
are what ties the different data files together. By default, Neuralynx data loaders return timestamps in microseconds (us).
Before getting started, create a folder with today's date in your project folder, and create a new file in it named sandbox.m
. These sandbox files are not meant to be re-used or committed to GitHub – as the name indicates, they are just a temporary file that is easier to work with compared to typing everything directly into the MATLAB Command Window.
Next, make sure that your path is set correctly using a Shortcut button. Also, set MATLAB's current directory to the data folder (R042-2013-08-18); you can do this either using the GUI or by using the cd
command.
All instructions that follow should be pasted into a cell in this sandbox file and executed from there (Ctrl-Enter when a cell is selected), unless they are prefaced with »
to indicate the Command Prompt.
The low-level loading function for video data is Nlx2MatVT
. Deploy it as follows:
%% load video data (make sure the VT1.zip file is unzipped first!) [Timestamps, X, Y, Angles, Targets, Points, Header] = Nlx2MatVT('VT1.nvt', [1 1 1 1 1 1], 1, 1, [] );
The abundance of ones in the function call are basically saying, “load everything” (type help Nlx2MatVT
for the gory details).
Notice that the output arguments (with the exception of the Header
) share a common dimension:
>> whos Name Size Bytes Class Attributes Angles 1x131898 1055184 double Header 28x1 4262 cell Points 400x131898 422073600 double Targets 50x131898 52759200 double Timestamps 1x131898 1055184 double X 1x131898 1055184 double Y 1x131898 1055184 double
We appear to have 131898 samples of “X” and “Y”, the main variables of interest, with corresponding timestamps. We can plot X against Y:
>> plot(X,Y);
to get:
You can see the outline of the modified T-maze (rotated 90 degrees). But this way of plotting the position data reveals something strange going on: there are many abrupt jumps to the (0,0) position! As it turns out, these are Neuralynx's way of indicating missing data (samples on which no position data could be acquired).
☛ Plot X against Y again, but this time without the missing data. A good way of doing this is to first define a variable keep_idx
that contains the indices of those samples which you want to keep (i.e. that are not (0,0)).
Inspect the resulting plot. The shape of the T-maze is now more clear; also visible are two roughly circular areas. These are the “pedestals” on which the rat can relax at the beginning and end of the recording session, as well as in between trials.
I plotted my version as follows:
%% plot video data -- use a new cell so that you can rerun this without also reloading the data fh = figure; set(fh,'Color',[0 0 0]); plot(X(keep_idx),Y(keep_idx),'.','Color',[0.7 0.7 0.7],'MarkerSize',1); axis off;
The first line opens a new figure, and uses its handle to set the background to black. The second line uses additional arguments for plot()
to plot the X and Y data points not as a connected line, but as individual points of size 1 in a gray color. The result:
It is useful to know how to save figures to a format that is easy to view:
set(gcf, 'InvertHardCopy', 'off'); print(gcf,'-r75','-dpng','module2_xvsy2.png');
The first line is necessary to preserve the black background. The second line saves a 75dpi PNG image. PNG is a good choice for saving MATLAB images, because it uses lossless compression and therefore will not cause ugly artifacts the way JPEG will.
Let's look at the Timestamps next, by plotting the X data as a function of time:
plot(Timestamps(keep_idx),X(keep_idx),'.r','MarkerSize',3) box off; set(gca,'FontSize',24);
Note the use of some different plotting options here, to give:
The horizontal axis is still in Neuralynx's raw data units (us).
☛ Convert the Timestamps to seconds, and replot.
If you look closely, you can spot some gaps in the data (times when no position data is plotted).
☛ Are these gaps because of (0,0) samples that have been removed? Or because there are no records in the data for those times?
As you should have ascertained, there are in fact two short gaps in the data. These occur on purpose to separate behavior on the T-maze (when you can see the X coordinate changing as the rat runs) from the times when the rat is resting on the pedestal. In the Cheetah software this can be done by simply turning off Recording and then turning it back on. (Sneak preview: although doing this is helpful for some applications, it can be problematic for analyses that assume your data is continuous. We will encounter this when we start using the FieldTrip toolbox in Module 6.)
☛ Determine the video tracker sampling rate from the Timestamps
variable. Watch out for gaps in the data!
This concludes the introduction to Neuralynx video data. The other outputs of Nlx2MatVT
are not used for typical analyses.
The Neuralynx loader for Ncs files is Nlx2MatCSC
. Use it thusly:
clear all; fname = 'CSC17.ncs'; [Timestamps, ~, SampleFrequencies, NumberOfValidSamples, Samples, Header] = Nlx2MatCSC(fname, [1 1 1 1 1], 1, 1, []);
..and inspect the result:
>> whos Name Size Bytes Class Attributes Header 33x1 5182 cell NumberOfValidSamples 1x17193 137544 double SampleFrequencies 1x17193 137544 double Samples 512x17193 70422528 double Timestamps 1x17193 137544 double fname 1x9 18 char
Now we get only 17193 Timestamps, a surprising number because it is substantially less than the number of video tracking timestamps we got (on the order of 10 times less), even though the video tracking data was only sampled at about 30 Hz, and this LFP data is supposed to be sampled at something like 2kHz! As it turns out, Neuralynx Ncs data is stored in blocks of 512 samples, with only the first sample of each block timestamped. Hence the [512 x 17193] size of Samples, which contains the actual time-varying voltage signal. This is not a very convenient format for plotting timestamps against voltage, the way we typically would like to do. This is one reason why we generally don't use these low-level loading functions, but instead wrap them in a function that is more user-friendly. These loading functions are discussed in the next section.
For now, one more point about this data: Samples
is not in units of volts, but on a scale internal to the Neuralynx system. To know how these “A-D bits” (analog-to-digital) correspond to real voltages, we need to look in the Header
:
>> Header Header = '######## Neuralynx Data File Header' '## File Name C:\CheetahData\2013-08-18_09-06-16\CSC17.ncs' '## Time Opened (m/d/y): 8/18/2013 (h:m:s.ms) 9:6:36.401' '## Time Closed (m/d/y): 8/18/2013 (h:m:s.ms) 10:26:2.464' '' '-FileType CSC' '-FileVersion 3.3.0' '-RecordSize 1044' '' '-CheetahRev 5.6.3 ' '' '-HardwareSubSystemName AcqSystem1' '-HardwareSubSystemType DigitalLynxSX' '-SamplingFrequency 2000' '-ADMaxValue 32767' '-ADBitVolts 0.000000061037020770982053' '' '-AcqEntName CSC17' '-NumADChannels 1' '-ADChannel 16' '-InputRange 2000' '-InputInverted True' '' '-DSPLowCutFilterEnabled True' '-DspLowCutFrequency 1' '-DspLowCutNumTaps 0' '-DspLowCutFilterType DCO' '-DSPHighCutFilterEnabled True' '-DspHighCutFrequency 475' '-DspHighCutNumTaps 128' '-DspHighCutFilterType FIR' '-DspDelayCompensation Disabled' '-DspFilterDelay_µs 1984'
Aha, the -ADBitVolts
entry gives us the conversion from the raw data to volts. Another reason to wrap this lowlevel function into something that does the conversion for us! As you can see, the header contains some other information, which will be discussed in more detail in Module 3.
*.Nev
(Neuralynx Event) files contain timestamps of various task events. Use as follows:
fn = FindFile('*Events.nev'); [EVTimeStamps, EventIDs, TTLs, EVExtras, EventStrings, EVHeader] = Nlx2MatEV(fn,[1 1 1 1 1],1,1,[]);
As before, all the ones in the function call make sure we load everything. In return, we get:
>> whos Name Size Bytes Class Attributes EVExtras 8x462 29568 double EVHeader 12x1 1924 cell EVTimeStamps 1x462 3696 double EventIDs 1x462 3696 double EventStrings 462x1 103104 cell TTLs 1x462 3696 double fn 1x44 88 char
Each of the 462 events in this file has a timestamp (EVTimeStamps
) and a description (EventStrings
) as well as some other information we generally don't need. Let's inspect some of the EventStrings
:
>> EventStrings(1:13) ans = 'Starting Recording' 'Stopping Recording' 'Starting Recording' 'TTL Input on AcqSystem1_0 board 0 port 1 value (0x0020).' 'TTL Input on AcqSystem1_0 board 0 port 1 value (0x0000).' 'TTL Input on AcqSystem1_0 board 0 port 1 value (0x0020).' 'TTL Input on AcqSystem1_0 board 0 port 1 value (0x0000).' 'TTL Input on AcqSystem1_0 board 0 port 1 value (0x0080).' 'TTL Input on AcqSystem1_0 board 0 port 1 value (0x0000).' 'TTL Output on AcqSystem1_0 board 0 port 0 value (0x0004).' 'TTL Input on AcqSystem1_0 board 0 port 1 value (0x0080).' 'TTL Output on AcqSystem1_0 board 0 port 0 value (0x0000).' 'TTL Input on AcqSystem1_0 board 0 port 1 value (0x0000).'
The meaning of these cryptic strings depends on the specific experimental setup. “AcqSystem1_0 board 0 port 0” and “1” refer to connectors on the Neuralynx data acquisition mainbox, which can be hooked up to various experimental peripherals such as photobeams, levers, and pellet dispensers.
In this session, Input/Output (I/O) Port 0 was configured as Output, controlling a pellet dispenser and a valve (for sucrose solution delivery). Port 1 was set to be an Input, receiving inputs from three photobeams (one on the central stem of the maze, and one each for each reward site on either end of the maze arms). The EventStrings
above refer to the status of an I/O port, represented as a hexadecimal number (indicated by the prefix “0x”). The activation of each peripheral is associated with a unique number.
☛ Find out which EventString
corresponds to which input or output (food pellet reward on left arm, sucrose water reward on right arm, left reward photobeam, right reward photobeam, central stem photobeam) by plotting the location of the animal at the time of each event.
Hint: example pseudocode for a nice approach to find this out would look like the following:
get list of unique event strings to process -- unique() for each event string find indices of events that match current event string -- strncmp() get timestamps for matched events find indices of position timestamps that are closest in time -- nearest_idx() get x and y coordinates of closest timestamps plot x and y coordinates on top of position plot end
As with the previous low-level loading functions, the Neuralynx loader does not provide us directly with what we want. We'd like a loader that just gives us the times for the events we are interested in, without us having to figure out what hexadecimal number they correspond to and then pull out the matching times.
Before we can proceed to the nice wrapped loading functions, you first need to understand the three main data types used for neural data analysis; this is necessary because the loaders return their data in these formats.
A data type is the computer science term for a standardized format of representing data. Classical data types include things like integers and floating-point numbers, but our data types of interest are essentially all MATLAB structs with particular constraints on field names and formats.
(Note for the connoisseurs: the choice to not implement these data types as MATLAB objects is deliberate.)
The three main data types are (1) timestamped data (TSD), (2) timestamps (TS), (3) and intervals (IV), discussed in turn.
Time-varying signals, such as extracellular potentials recorded by an intracranial electrode, or position data recorded by an overhead camera, are very common in (neuro)science. Such signals are acquired through sampling, that is, a data point is acquired at some specific time, and then some time later, another point, and so on (the idea of sampling and some consequences are explored in detail in Module 3). The result of this is that instead of a truly continuous signal, we have instead a set of points, each with a timestamp and a value:
Thus, what we need to fully describe such a signal is two arrays of the same length: one with the timestamps and the other with the values. This is exactly what the timestamped data (TSD) data type is, as illustrated by the LoadCSC()
function:
cfg = []; cfg.fc = {'CSC17.ncs'}; csc = LoadCSC(cfg); >> csc csc = tvec: [8802816x1 double] data: [1x8802816 double] label: {'CSC17.ncs'} cfg: [1x1 struct]
The TSD data type has the following fields:
tvec
: nSamples x 1 double, timestamps (in seconds)data
: nSignals x nSamples double, values (units can be specified in cfg if needed)label
: nSignals x 1 cell array, filenamescfg
: content depends on specific data, but always has a history
field. For CSC data, there is also hdr
, ExpKeys
, and SessionID
.
In the above example, we only loaded one .Ncs file and therefore there is only one label. To plot this data you can simply do plot(csc.tvec,csc.data)
.
A different data type is needed to describe sets of punctate events (a point process in statistics), such as times of action potentials (spikes) or task events such as reward delivery times. For this we use the TS (timestamp) data type, defined as follows:
t
: nSignals x 1 cell array, timestamps (in seconds)label
: nSignals x 1 cell array, labelscfg
: content depends on specific data, but always has a history
field.
An example is provided by the event loader LoadEvents()
:
%% cfg = []; evt = LoadEvents(cfg); >> evt evt = t: {1x9 cell} label: {1x9 cell} cfg: [1x1 struct]
Because we did not specify a specific configuration, LoadEvents just gives us the times of all EventStrings by default; these have become the entries in the label
field.
Interval data – matched sets of start and end times – is not loaded directly from promoted data files. However, it commonly comes up during analysis, for instance when defining trials, running vs. resting epochs, sharp wave-ripple complexes, et cetera.
Interval data is defined as follows:
tstart
: nIntervals x 1 double, interval start times (in seconds)tend
: nIntervals x 1 double, end times (in seconds)cfg
: content depends on specific data, but always has a history
field.Some common ways of creating an iv object from scratch are the following:
>> a = iv([1 2]) % define a single interval from 1 to 2 a = tstart: 1 tend: 2 cfg: [1x1 struct] >> b = iv([1 2],[3 3]) % define two intervals, 1 to 3 and 2 to 3 b = tstart: [2x1 double] tend: [2x1 double] cfg: [1x1 struct]
You have already seen examples of TSD and TS data types returned by some loading functions. The full set follows below. You will notice that each loading function takes in a cfg
(“configuration”) variable, which is used to specify parameters and options such as the filenames to be loaded. This use of cfg
variables is shared by many other vandermeerlab data analysis functions, and is highly encouraged when you start writing your own code. For a full description and rationale, see here; in brief, it forces well-organized code and enables provenance tracking, two principles of good programming practice.
Remember, this cfg
variable is separate from the cfg field created within the data structure. If you are in doubt about how to use the cfg parameters, use the help
function on each data loader:
>> help LoadCSC function csc_tsd = LoadCSC(cfg) loads Neuralynx .ncs files, handling missing samples correctly INPUTS: only cfg cfg.fc: cell array containing filenames to load if no file_list field is specified, loads all *.Ncs files in current dir cfg.TimeConvFactor = 10^-6; % from nlx units to seconds cfg.VoltageConvFactor = 1; % factor of 1 means output will be in volts OUTPUTS: csc_tsd: CSC data tsd struct MvdM 2014-06-18, 25 (use cfg_in)
Alternatively, you can use doc LoadCSC
to open the same help file with a link to the original code.
This loads raw Neuralynx position data (*.nvt). If no filename is specified in the input cfg, LoadPos()
checks if a single .Nvt file is found in the current directory and loads that one:
>> cfg = []; >> posdata = LoadPos(cfg); >> posdata posdata = tvec: [1x131898 double] data: [2x131898 double] label: {'x' 'y'} cfg: [1x1 struct]
Because the .Nvt files are large, it is often convenient to save this posdata
variable as a .mat file. This should be named Rxxx-yyyy-mm-dd-vt.mat
.
Note that the data
field now has dimensionality [2 x nSamples]; this is because there is both x and y data as indicated by the label
field. So, if you wanted to plot x against y, you could do plot(posdata.data(1,:),posdata.data(2,:),'.');
, but a more general approach that doesn't require knowing which variable is which dimension is plot(getd(posdata,'x'),getd(posdata,'y'));
.
To load .Ncs
files with LFPs:
cfg = []; cfg.fc = {'CSC17.ncs'}; csc = LoadCSC(cfg);
This gives the following TSD:
>> csc csc = tvec: [8802816x1 double] data: [2x8802816 double] label: {'CSC17.ncs' 'CSC18.ncs'} cfg: [1x1 struct]
Note that the format is the same as for the position data above; this is because both LoadPos()
and LoadCSC()
return TSDs.
LoadCSC()
outputs some information about the files being loaded; in particular the number of “bad blocks”. These will be explored in Module 3 (short version: indicates a problem with the recording system and should be fixed).
Finally, the cfg
field has the ExpKeys, the SessionID (R042-2013-08-18
), the headers (.hdr
) for each .Ncs file, and the history
.
By default, LoadEvents()
returns a TS with the labels and timestamps of all unique strings found in the EventStrings:
cfg = []; evt = LoadEvents(cfg); >> evt evt = t: {1x9 cell} label: {1x9 cell} cfg: [1x1 struct]
evt.label(:)
will reveal the familiar list of events introduced above. However, by using the cfg file, we can get something more specific:
cfg = []; cfg.eventList = {'TTL Output on AcqSystem1_0 board 0 port 0 value (0x0004).','TTL Output on AcqSystem1_0 board 0 port 0 value (0x0040).'}; cfg.eventLabel = {'FoodDelivery','WaterDelivery'}; evt = LoadEvents(cfg); >> evt evt = t: {[1x9 double] [1x9 double]} label: {'FoodDelivery' 'WaterDelivery'} cfg: [1x1 struct]
By specifying which EventString is associated with which human-readable event ('FoodDelivery','WaterDelivery') we now have a more user-friendly events variable.
LoadSpikes() is straightforward:
%% cfg = []; S = LoadSpikes(cfg); >> S S = t: {1x67 cell} label: {1x67 cell} cfg: [1x1 struct]
By default, LoadSpikes()
attempts to read the .clusters
file generated by MClust to obtain cluster ratings. This can be overridden by specifying cfg.useClustersFile = 0;
. If you wish to load *._t files, do cfg.load_questionable_cells = 1;
.
The other files of interest are all MATLAB .mat
files which can be loaded directly using the load() function.