wiki

~~DISCUSSION~~ ==== Module 2 ==== Goals: * Obtain a basic overview of the different file formats saved by a Neuralynx system * Become aware of the pre-processing steps typically applied to raw data * Get to know the different files in a promoted (pre-processed) data set, and their relationship to the raw data * Understand how the data are represented in MATLAB data types (ts, tsd, and iv) * Use the wrapped data loaders * Visualize the raw data Resources: * (reference) {{:analysis:nsb2014:neuralynxdatafileformats.pdf|Neuralynx data file formats specification (technical)}} * (reference) [[http://www.mathworks.com/help/matlab/structures.html|Introduction to MATLAB structures (structs)]] === Introductory remarks === Careful analysis of neural data begins with a thorough understanding of the raw data that is saved by the data acquisition system (usually referred to here as "Neuralynx" or "Cheetah"). However, raw data is only rarely suitable for immediate analysis. At a minimum, freshly acquired data sets must be annotated and the files systematically renamed -- for instance, with the ID of the experimental subject and some information about recording locations -- so that the analyst can select which files to analyze. More complex pre-processing steps inlcude spike sorting, the process of assigning spike waveforms to putative single neurons to obtain their spike times. === Data files overview === The ''R042-2013-08-18'' folder you obtained in the previous module contains data from a single recording session that has been pre-processed so that it is ready for analysis. Such a pre-processed data set is referred to as "promoted"; raw data that has just been recorded is "incoming", data being pre-processed is "inProcess". The schematic below gives an overview of the major data files and their transformation during pre-processing: <graphviz> strict digraph G { resolution = 300; fontname = Helvetica; overlap = false; rankdir = BT; /* concentrate = true; */ node [fontname = "Helvetica",shape=ellipse,fontsize=9]; edge [fontname = "Helvetica",fontsize=9]; subgraph cluster0 { /* promoted data */ node [style=filled,fillcolor=white,color=black,fontsize=10]; style=filled; color=lightgrey; spk [label="spikes\n*.t files"]; lfp [label="LFPs\n*.ncs files"]; pos [label="position\n*.mat file (tsd)"]; evt [label="events\n*.nev file"]; keys [label="ExpKeys\n*keys.m file"]; wv [label="Waveforms\n*wv.mat files"]; cq [label="ClusterQual\n*.mat files"]; } subgraph cluster1 { /* legend */ node [color=black,style=filled,fillcolor=white,fontsize=10]; style = "filled"; color=".3 .3 .7"; label = "legend"; l1 [label="processed file"]; l2 [shape=rectangle,label="raw file",fillcolor=purple,fontcolor=white]; l3 [style=filled,shape=diamond,fillcolor=red,fontcolor=white,nedges=5,label="function"]; } ann [shape=plaintext,label="description of\nsubject, session,\nconditions, etc."]; ann -> keys [label=" manual\n annotation"]; raw_evt [style=filled,shape=rectangle,label="*.nev file",fillcolor=purple,fontcolor=white]; raw_evt -> evt [label=" rename"]; raw_pos [style=filled,shape=rectangle,label="*.nvt file",fillcolor=purple,fontcolor=white]; raw_pos -> pos [label=" load into MATLAB\n save as tsd"]; raw_lfp [style=filled,shape=rectangle,label="*.ncs files",fillcolor=purple,fontcolor=white]; raw_lfp -> lfp [label=" rename"]; ntt [style=filled,shape=rectangle,label="*.ntt files",fillcolor=purple,fontcolor=white]; ntt_ren [shape=rectangle,label="renamed *.ntt files"]; mclust [shape=rectangle,label="*.clu files"]; temp1 [style=filled,shape=diamond,fillcolor=red,fontcolor=white,nedges=5,label="MClust"]; ntt -> ntt_ren [label=" rename"]; ntt_ren -> mclust [label=" automatic\n preclustering\n (KlustaKwik)"]; mclust -> temp1; ntt_ren -> temp1; temp1 -> spk; // ccqf [style=filled,shape=diamond,fillcolor=red,fontcolor=white,nedges=5,label="CreateCQFile()"]; ntt_ren -> ccqf; spk -> ccqf; ccqf -> wv; ccqf -> cq; } </graphviz> The files you find in a promoted folder such as ''R042-2013-08-18'' are those enclosed in the gray box. They are: * Each ''.ncs'' file ("**N**euralynx **C**ontinuously **S**ampled") contains a single channel of continuously sampled voltage data. The sampling rate and filters for these channels can be configured in the Cheetah data acquisition software. Typically, as in this data set, the sampling rate and filters are set so that these files are local field potentials (LFPs) sampled at 2kHz and filtered between 1 and 475 Hz. It is also possible to have wide-band, 32kHz ''.ncs'' files suitable for spike extraction, but these are not included in the current dataset. * Each ''.t'' file contains a set of times -- a spike train from a putative neuron. The qualifier "putative" is used because this is extracellular data and spike-sorting is not perfect, so it's likely there will be some spikes missing and some spikes included that are not from this neuron. Always remember this even if I will omit the "putative" from now on for short! ''*.t'' files are generated by MClust, a spike sorting tool developed by A. David Redish, from the raw ''*.ntt'' ("**N**euralynx **T**e**T**rode") files saved by Neuralynx. ''*.ntt'' files do not contain continuously sampled data; instead, a one-millisecond snapshot across the channels of a tetrode is stored whenever any of the four channels exceeds a threshold set in Cheetah by the experimenter. * The ''*.nvt'' file ("**N**euralynx **V**ideo **T**racking") contains the location of the rat as tracked by an overhead camera. For Neuralynx systems, this is typically sampled at 30 Hz. Because the raw files are large, they are usually stored in compressed (zip) format, or loaded into MATLAB and saved as a ''.mat'' file. These files are in units of camera pixels (640x480 in this case). * The ''*.Nev'' file ("**N**euralynx **EV**ents") contains timestamps and labels of events, such as those input by the user during recording, received from experimental components connected to Neuralynx's digital I/O (Input/Output) port, and system messages such as recording start, data loss, et cetera. A critical part of any data set is the following: * The ''*keys.m'' file, referred to as "%%ExpKeys%%" or "keys". This file contains experimenter-provided information that describes this data set. This information is stored as a ''.m'' file so that it can be edited and read by standard text editors (rather than having to be loaded into MATLAB to view, as would be the case for a ''.mat'' file). This file and the correct format for %%ExpKeys%% will be explored in detail later. In brief, it defines a MATLAB structure named ''ExpKeys'' with fields that describe what was done in the current session. Finally, there is also: * ''*wv.mat'' files. There is one file for each ''*.t'' file, containing the average waveforms for that cell. * ''*ClusterQual.mat'' files. Also, one file for each ''*.t'' file, containing some cluster quality statistics. Both of these files are generated by a MATLAB script (''CreateCQFile.m'') or directly from %%MClust%% version 4.1 or higher. ☛ Look at the contents of the ''R042-2013-08-18'' folder. Notice how each file is named: all start with ''R042-2013-08-18'' followed by a suffix indicating the file type and (if necessary) an identifier. Applying this naming scheme consistently is a key part of good data management because it enables provenance tracking -- which cells from what animal, what sesssion, and what condition are contributing to each plot, what [[analysis:nsb2014:week5|cluster]] did it come from, et cetera. The **rename** steps in the above schematic are a critical first step. === Introduction to vandermeerlab data types === Before we can proceed to the nice wrapped loading functions, you first need to understand the three main **data types** used for neural data analysis; this is necessary because the loaders return their data in these formats. A data type is the computer science term for a standardized format of representing data. Classical data types include things like [[http://en.wikipedia.org/wiki/Integer|integers]] and [[http://en.wikipedia.org/wiki/Floating_point|floating-point]] numbers, but our data types of interest are essentially all MATLAB [[http://www.mathworks.com/help/matlab/structures.html|structs]] with particular constraints on field names and formats. (Note for the connoisseurs: the choice to not implement these data types as MATLAB objects is deliberate.) The three main data types are (1) timestamped data (TSD), (2) timestamps (TS), (3) and intervals (IV), discussed in turn. == Timestamped data (TSD) data-type == Time-varying signals, such as extracellular potentials recorded by an intracranial electrode, or position data recorded by an overhead camera, are very common in (neuro)science. Such signals are acquired through //sampling//, that is, a data point is acquired at some specific time, and then some time later, another point, and so on (the idea of sampling and some consequences are explored in detail in Module 3). The result of this is that instead of a truly continuous signal, we have instead a set of points, each with a timestamp and a value: {{ :analysis:nsb2014:tsd_example.png?600 |}} Thus, what we need to fully describe such a signal is two arrays of the same length: one with the timestamps and the other with the values. This is exactly what the timestamped data (TSD) data type is, as illustrated by the ''LoadCSC()'' function: <code matlab> cfg = []; cfg.fc = {'R042-2013-08-18-CSC03a.ncs'}; csc = LoadCSC(cfg); >> csc csc = tvec: [8802816x1 double] data: [1x8802816 double] label: {'R042-2013-08-18-CSC03a.ncs'} cfg: [1x1 struct] </code> The TSD data type has the following fields: * ''tvec'': //nSamples x 1 double//, timestamps (in seconds) * ''data'': //nSignals x nSamples double//, values (units can be specified in cfg if needed) * ''label'': //nSignals x 1 cell array//, filenames * ''cfg'': content depends on specific data, but always has a ''history'' field. For CSC data, there is also ''hdr'', ''ExpKeys'', and ''SessionID''. In the above example, we only loaded one .Ncs file and therefore there is only one label. To plot this data you can simply do ''plot(csc.tvec,csc.data)''. == Timestamp (TS) data-type == A different data type is needed to describe sets of punctate events (a //point process// in statistics), such as times of action potentials (spikes) or task events such as reward delivery times. For this we use the TS (timestamp) data type, defined as follows: * ''t'': //nSignals x 1 cell array//, timestamps (in seconds) * ''label'': //nSignals x 1 cell array//, labels * ''cfg'': content depends on specific data, but always has a ''history'' field. An example is provided by the event loader ''LoadEvents()'': <code> %% cfg = []; evt = LoadEvents(cfg); >> evt evt = t: {1x9 cell} label: {1x9 cell} cfg: [1x1 struct] </code> Because we did not specify a specific configuration, LoadEvents just gives us the times of all EventStrings by default; these have become the entries in the ''label'' field. == Interval (IV) data-type == Interval data -- matched sets of start and end times -- is not loaded directly from promoted data files. However, it commonly comes up during analysis, for instance when defining trials, running vs. resting epochs, sharp wave-ripple complexes, et cetera. Interval data is defined as follows: * ''tstart'': //nIntervals x 1 double//, interval start times (in seconds) * ''tend'': //nIntervals x 1 double//, end times (in seconds) * ''cfg'': content depends on specific data, but always has a ''history'' field. Some common ways of creating an iv object from scratch are the following: <code matlab> >> a = iv([1 2]) % define a single interval from 1 to 2 a = tstart: 1 tend: 2 cfg: [1x1 struct] >> b = iv([1 2],[3 3]) % define two intervals, 1 to 3 and 2 to 3 b = tstart: [1 2] tend: [3 3] cfg: [1x1 struct] </code> === Using the wrapped data loaders === You have already seen examples of TSD and TS data types returned by some loading functions. The full set follows below. You will notice that each loading function takes in a ''cfg'' ("configuration") variable, which is used to specify parameters and options such as the filenames to be loaded. This use of ''cfg'' variables is shared by many other vandermeerlab data analysis functions, and is highly encouraged when you start writing your own code. For a full description and rationale, see [[analysis:nsb2014:cfgsystem|here]]; in brief, it forces well-organized code and enables provenance tracking, two principles of good programming practice. ☛ The code blocks that follow below are examples to illustrate usage of the loading functions. Feel free to try them out, but you don't need to execute these code blocks in order to proceed. == LoadPos() == This loads raw Neuralynx position data (*.nvt). (In your data folder, this raw file is in a .zip archive by default.) If no filename is specified in the input cfg, ''LoadPos()'' checks if a single .Nvt file is found in the current directory and loads that one: <code matlab> >> cfg = []; >> posdata = LoadPos(cfg); >> posdata posdata = tvec: [1x131898 double] data: [2x131898 double] label: {'x' 'y'} cfg: [1x1 struct] </code> Because the .Nvt files are large, it is often convenient to save this ''posdata'' variable as a .mat file. This should be named ''Rxxx-yyyy-mm-dd-vt.mat''; from now on in this tutorial, you can simply use the already provided, previously saved file. Note that the ''data'' field now has dimensionality [**2** x nSamples]; this is because there is both x and y data as indicated by the ''label'' field. So, if you wanted to plot x against y, you could do ''plot(posdata.data(1,:),posdata.data(2,:),'.');'', but a more general approach that doesn't require knowing which variable is which dimension is ''plot(getd(posdata,'x'),getd(posdata,'y'));''. == LoadCSC() == To load ''.Ncs'' files with LFPs: <code matlab> cfg = []; cfg.fc = {'R042-2013-08-18-CSC03a.ncs'}; csc = LoadCSC(cfg); </code> This gives the following TSD: <code matlab> >> csc csc = tvec: [8802816x1 double] data: [1x8802816 double] label: {'R042-2013-08-18-CSC03a.ncs'} cfg: [1x1 struct] </code> Note that the format is the same as for the position data above; this is because both ''LoadPos()'' and ''LoadCSC()'' return TSDs. ''LoadCSC()'' outputs some information about the files being loaded; in particular the number of "bad blocks", which if they occur indicates a problem with the recording system that should be fixed! Finally, the ''cfg'' field has the %%ExpKeys%%, the %%SessionID%% (''R042-2013-08-18''), the headers (''.hdr'') for each .Ncs file, and the ''history''. == LoadEvents() == By default, ''LoadEvents()'' returns a TS with the labels and timestamps of all unique strings found in the %%EventStrings%% of the raw events data: <code matlab> cfg = []; evt = LoadEvents(cfg); >> evt evt = t: {1x9 cell} label: {1x9 cell} cfg: [1x1 struct] </code> ''evt.label(:)'' will reveal a list of cryptic event descriptions. However, by using the cfg file, we can get something more sensible: <code matlab> cfg = []; cfg.eventList = {'TTL Output on AcqSystem1_0 board 0 port 0 value (0x0004).','TTL Output on AcqSystem1_0 board 0 port 0 value (0x0040).'}; cfg.eventLabel = {'FoodDelivery','WaterDelivery'}; evt = LoadEvents(cfg); >> evt evt = t: {[1x9 double] [1x9 double]} label: {'FoodDelivery' 'WaterDelivery'} cfg: [1x1 struct] </code> By specifying which %%EventString%% is associated with which human-readable event ('FoodDelivery','WaterDelivery') we now have a more user-friendly events variable. == LoadSpikes() == LoadSpikes() is straightforward: <code matlab> %% cfg = []; S = LoadSpikes(cfg); >> S S = t: {1x67 cell} label: {1x67 cell} cfg: [1x1 struct] </code> By default, ''LoadSpikes()'' attempts to read the ''.clusters'' file generated by %%MClust%% to obtain cluster ratings. This can be overridden by specifying ''cfg.useClustersFile = 0;''. If you wish to load *._t files, do ''cfg.load_questionable_cells = 1;''. === Visualizing the raw data === Let's combine the loaders above with a simple visualization. ☛ From now on, you should execute the code blocks provided. The best way to do this is to create a ''sandbox.m'' file, create a new Cell for each block, paste the code into it, and hit Ctrl+Enter. <code matlab> %% load the data clear all; pack cd('C:\data\R042\R042-2013-08-18'); % replace with your data location load(FindFile('*vt.mat')); % position data (previously extracted and filtered from raw data) load(FindFile('*times.mat')); % trial start and end data (previously generated from raw data) load(FindFile('*CoordL.mat')); % used for linearizing position data cfg = []; cfg.load_questionable_cells = 1; S = LoadSpikes(cfg); cfg = []; cfg.fc = {'R042-2013-08-18-CSC03a.ncs'}; csc = LoadCSC(cfg); S = restrict(S,S.cfg.ExpKeys.TimeOnTrack,S.cfg.ExpKeys.TimeOffTrack); csc = restrict(csc,S.cfg.ExpKeys.TimeOnTrack,S.cfg.ExpKeys.TimeOffTrack); </code> The ''restrict()'' function restricts the data to the given start and end times. In this case, we are restricting the data to the times the rat was actually performing the task, specified in the ''ExpKeys''. This excludes the baseline recording blocks taken at the beginning and end of each recording session. If the above executes, you can now plot a raster of the spikes, along with the LFP: <code matlab> %% rest t1 = 5030; t2 = 5060; cfg = []; cfg.lfp = restrict(csc,t1,t2); PlotSpikeRaster(cfg,restrict(S,t1,t2)); </code> This should give: {{ :analysis:cosmo2014:example_raster1.png?900 |}} The x-axis shows time in seconds, and the y-axis contains the spikes from our neurons, one per row. The local field potential from one of our electrodes is shown underneath. This particular segment is taken from a time when the rat was resting on a pedestal. You can see some synchronous activation of a large number of cells showing up as a vertical stripe, for instance around time 5045.4. If you zoom in on this, you will see a fast oscillation in the LFP ("ripple") occurring at the time of this potential replay event. Now let's look at a raster from when the rat was running on the track: <code matlab> %% run close all; iRun = 7; t1 = run_start(iRun); t2 = run_end(iRun); cfg = []; cfg.lfp = restrict(csc,t1,t2); PlotSpikeRaster(cfg,restrict(S,t1,t2)); </code> You should get: {{ :analysis:cosmo2014:example_raster2.png?900 |}} There are some obvious differences with the previous plot, including the presence of a "theta" (~8Hz) oscillation in the LFP, and the rhythmic firing patterns of some neurons (likely "place cells"). Next, let's get a better view of what the rat was actually doing by plotting the position data, with the spikes from an example neuron on top: <code matlab> close all; plot(getd(pos,'x'),getd(pos,'y'),'.','Color',[0.5 0.5 0.5],'MarkerSize',1); axis off; hold on; iC = 7; spk_x = interp1(pos.tvec,getd(pos,'x'),S.t{iC},'linear'); spk_y = interp1(pos.tvec,getd(pos,'y'),S.t{iC},'linear'); h = plot(spk_x,spk_y,'.r'); </code> This gives: {{ :analysis:cosmo2014:example_scatter.png?600 |}} The gray trace is the position tracking data, revealing the outline of the T-maze used. The red dots indicate the location of the rat when this particular neuron fired a spike; most spikes are nicely localized to a particular region on the maze (the cell's "place field"). Note the use of ''interp1()'' above: this is a frequent operation when analyzing spike train data. Those statements essentially say: "for a given set of spike times, get me the best guess (linear interpolation) of the time-varying signal defined by pos.tvec and pos.x". This concludes the data loading and visualization module. ==== For reference: example usage ==== This is an example of how the tsd and iv data types are used to detect "sharp wave-ripple complexes" (SWRs) associated with "replay" of place cell sequences: <code matlab> cfg = []; cfg.fc = {'R042-2013-08-18-CSC03a.ncs'}; % cell array with filenames to load csc = LoadCSC(cfg); cfg = []; cfg.decimateFactor = 4; % reduce sampling frequency to speed things up csc = decimate_tsd(cfg,csc); cfg = []; cfg.f = [140 220]; cscF = FilterLFP(cfg,csc); % filter cscF = LFPpower([],cscF); % obtain instantaneous signal power (Hilbert transform) cscF = zscore_tsd(cscF); % normalize cfg = []; cfg.method = 'raw'; cfg.threshold = 3; cfg.dcn = '>'; % return intervals where threshold is exceeded cfg.merge_thr = 0.05; % merge events closer than this cfg.minlen = 0.01; % minimum interval length SWR1 = TSDtoIV(cfg,cscF); % detect intervals where z-scored power is > 3 SD above mean cfg = []; cfg.tvec = cscF.tvec; cfg.sigma = 0.025; MUA = getMUA(cfg,S); % get multi-unit activity (need to load S with LoadSpikes()) MUA = zscore_tsd(MUA); cfg = []; cfg.method = 'raw'; cfg.threshold = 3; cfg.dcn = '>'; % return intervals where threshold is exceeded cfg.merge_thr = 0.05; % merge events closer than this cfg.minlen = 0.02; % minimum interval length SWR2 = TSDtoIV(cfg,MUA); SWR = IntersectIV([],SWR2,SWR1); % obtain intersection between intervals (both LFP power and MUA) </code>

wiki

User Tools

Site Tools

Sidebar

Page Tools