wiki

~~DISCUSSION~~ ===== Anatomy of neural data: sampling, aliasing ===== Goals: * Understand the fundamentals of sampling theory: Nyquist frequency, Nyquist rate, aliasing, anti-aliasing * Gain an awareness of the different ways of representing data (signed, unsigned, integer, floating-point) * Examine the file structure of Neuralynx continuously sampled data, in all its gory detail * Get familiar with object-oriented programming in MATLAB Deliverables: * An improved version of ''LoadCSC()'' Resources: * (required) read Chapter 3 of the Leis book. Skip sections 3.4.3, 3.4.4, 3.4.5, 3.4.6, 3.4.7, 3.7. Skim section 3.6. * (likely helpful) [[http://www.mathworks.com/discovery/object-oriented-programming.html|Object-oriented programming (OOP) in MATLAB]] official documentation, and a [[http://www.cs.ubc.ca/~murphyk/Software/matlabTutorial/html/objectOriented.html|nice tutorial]] that clearly explains the differences with the [[http://www.cs.ubc.ca/~murphyk/Software/matlabTutorial/html/objectOrientedOldStyle.html|old implementation]] of OOP (which is still used for many pieces of the vandermeerlab codebase) ===== Introductory remarks ===== Systems for the acquisition of neural data give us a necessarily imperfect view of the brain. Some of these limitations are obvious, such as a finite number of recording sites and contamination of the data with movement artifacts and system noise. Other limitations are perhaps more subtle, and include the pitfalls associated with limited (temporal) sampling of the underlying process and the properties of specific recording systems. An important phrase to remember in any data analysis workflow is "garbage in, garbage out." Making sure that your input is not garbage starts with a thorough understanding of the recorded signal and how it is stored. (Of course, if you abuse it enough, even the best data can rapidly turn to garbage. But that is beyond the scope of this module, which deals with the raw data.) ===== Step-by-step ===== ==== Experiments in aliasing ==== Subsampling is common in neural data analysis; for instance, the lab's Amplipex system records data sampled at 20kHz, but for some analyses we are only interested in frequencies in the 1-200Hz range. Thus it would be unnecessarily slow to keep 20kHz data around: at this rate, even a basic 64-channel dataset easily exceeds 6GB in size! So, how do we go about subsampling correctly? This section explores the key issue of **aliasing**: the appearance of "phantom" low-frequency signals through the subsampling of frequencies above the **Nyquist frequency**. See the Leis chapter for a thorough discussion of this. We begin by creating a simple 100Hz signal, sampled at 2kHz: <code matlab> fs1 = 2000; tvec = 0:1/fs1:4; % construct time axis, sampled at 2kHz freq1 = 100; y = sin(2*pi*freq1*tvec); % 100Hz signal ax1 = subplot(211); stem(tvec,y); title('original'); </code> Say we wish to subsample down to 500Hz to save space. Naively, we might simply take every 4th sample (in fact, this is what the MATLAB function ''downsample()'' does): <code matlab> subsample_factor = 4; tvec2 = tvec(1:subsample_factor:end); % take every 4th sample y2 = y(1:subsample_factor:end); ax2 = subplot(212); stem(tvec2,y2,'r'); title('subsampled'); xlabel('time (s)'); </code> It's hard to see what is going on when we are so zoomed out. Let's use the //axes handles// introduced last week to rectify this: <code matlab> xl = [1 1.04]; set(ax1,'XLim',xl); set(ax2,'XLim',xl); </code> That's better. Notice that in the subsampled plot, you can detect the presence of the 100Hz signal by noticing that each sample repeats (has the same value) every 10ms. However, it doesn't look much like the original -- as discussed in the Leis book, we can attempt to reconstruct it: <code matlab> hold on; y_interp = interp1(tvec2,y2,tvec,'linear'); p1 = plot(tvec,y_interp,'b'); y_interp2 = interp1(tvec2,y2,tvec,'spline'); p2 = plot(tvec,y_interp2,'g'); legend([p1 p2],{'linear','spline'},'Location','Northeast'); legend boxoff </code> Note carefully the operation of **interp1()**: this is one of the most frequently used MATLAB functions in signal processing. Calling ''interp1()'' here is like saying: I know the values of ''y2'' at times ''tvec2'', please estimate the values of ''y2'' at times ''tvec''. You should now see this: {{ :analysis:course:week3_fig1.png?600 |}} As you can see, the "spline" interpolation method recovers the original signal very well. (Apart from "linear" interpolation, there is also "nearest" which often comes in handy.) Note also the use of the plot object handles ''p1'' and ''p2'' (which you already know you can use to change line properties; see last week) as inputs to ''legend()'' ([[http://www.mathworks.com/help/matlab/ref/legend.html|MATLAB doc page]]). Now, suppose we were oblivious to the idea of the **Nyquist rate** and we attempted to subsample to 200Hz, expecting to see our original 100Hz signal. ☛ Modify the subsampling code above to use a ''subsample_factor'' of 10 and plot the result. You should see the following: {{ :analysis:course:week3_fig2.png?600 |}} Oops! What is going on here? First, it is obvious we are //not// seeing a 100Hz signal. Second, what are these weird small numbers (note the scale)? This is actually a consequence of the finite precision MATLAB uses in representing numbers such as ''pi''. It is important to be aware of this, otherwise you may be extremely puzzled: <code matlab> sin(2*pi) == 0 % ...right? The answer might surprise you. fprintf('Welcome to numerical computing!\n'); </code> [[http://blogs.mathworks.com/loren/2006/08/23/a-glimpse-into-floating-point-accuracy/|This page]] (from Loren on the Art of MATLAB, an excellent blog for MATLAB tips) and the [[http://www.mathworks.com/help/matlab/matlab_prog/floating-point-numbers.html|MATLAB docs]] give the gory details, also discussed in the Leis chapter. For the moment we will not pursue this further and get back to aliasing. But wait, wasn't the idea that sampling at the Nyquist frequency should be sufficient? Let's try again: <code matlab> figure(2) subsample_factor = 10; tvec2 = tvec(2:subsample_factor:end); % best case scenario -- can detect 100Hz signal y2 = y(2:subsample_factor:end); subplot(212) stem(tvec2,y2,'r'); set(gca,'XLim',xl); </code> OK, so here is our 100Hz signal, but as illustrated above, we might miss it if we happen to sample at an unlucky phase. Best to not cut it this fine, also for other reasons that will become clear further on in this module. We first wonder, is this even an acceptable way of subsampling, anyway? Let's do another test: <code matlab> %% 2kHz Fs, 100Hz signal with 450Hz signal superimposed figure(3) fs1 = 2000; tvec = 0:1/fs1:4; freq1 = 100; freq2 = 450; % note, above Nyquist frequency for our target subsampled Fs y = sin(2*pi*freq1*tvec) + 0.5.*sin(2*pi*freq2*tvec); subplot(211) stem(tvec,y) set(gca,'XLim',xl); </code> Suppose we are aware of Nyquist, but decide we don't care about this 450Hz component in the signal, so we go ahead and subsample down to 500Hz again: <code matlab> %% ss -- we don't care about the 450Hz signal, but... subsample_factor = 4; tvec2 = tvec(1:subsample_factor:end); y2 = y(1:subsample_factor:end); subplot(212) stem(tvec2,y2,'r'); set(gca,'XLim',xl); </code> The result: {{ :analysis:course:week3_fig3.png?600 |}} ☛ Inspect the subsampled signal. How does it compare to the previous result where we subsampled the pure 100Hz signal without the 450Hz component? You should notice some points don't have a 100Hz repeat, so some other components crept in. This is **aliasing**, and is very dangerous: you might excitedly conclude you discovered a novel slower frequency in your favorite brain region and try to publish a paper about it. Don't do it! So, what is the correct way? Given that subsampling is such an ubiquitous operation, MATLAB does the hard work for us. ☛ Use the ''decimate()'' function to subsample down to 500Hz. Inspect the difference with the incorrect method (in red). What do you notice? In fact, ''decimate()'' first filters out frequency components that may cause aliasing. This is known as an **anti-aliasing filter**. Always remember to use this, and not ''downsample()'' or a variant thereof, when you want to downsample your data. ==== Neuralynx data in detail ==== An idealized (neural) signal consists of samples acquired at exactly the specified sampling frequency, with no gaps or missing data. Real data, unfortunately, does not always meet this specification. This section illustrates some flaws typical for the Neuralynx system that may affect subsequent analysis. === Using the low-level loader to get the raw data === To get into the guts of actual Neuralynx data, we will not use the sanitized wrapper provided by ''LoadCSC()'' but instead use the loading function provided by Neuralynx. Using cell mode in a sandbox file as usual, ''cd'' into the ''R016-2012-10-08'' data folder you downloaded previously in Week 1. Then deploy the Neuralynx loader: <code matlab> %% % cd to your location here fname = 'R016-2012-10-08-CSC03b.ncs'; [Timestamps, ~, SampleFrequencies, NumberOfValidSamples, Samples, Header] = Nlx2MatCSC(fname, [1 1 1 1 1], 1, 1, []); </code> Note the use of the ''~'' output argument, which indicates that we do not wish to assign the second output of the ''Nlx2MatCSC'' function to a variable. Most data files come with a "header" that describes some properties of the data. Let's look at the header of our file, which is a LFP recorded from the ventral striatum (nucleus accumbens core) of a rat running back and forth along a linear track: <code matlab> >> Header Header = '######## Neuralynx Data File Header' '## File Name C:\CheetahData\2012-10-08_11-24-30\CSC10.ncs' '## Time Opened (m/d/y): 10/8/2012 (h:m:s.ms) 11:25:2.79' '## Time Closed (m/d/y): 10/8/2012 (h:m:s.ms) 12:32:9.61' '-CheetahRev 5.5.1 ' '' '-AcqEntName CSC10' '-FileType CSC' '-RecordSize 1044' '' '-HardwareSubSystemName AcqSystem1' '-HardwareSubSystemType DigitalLynxSX' '-SamplingFrequency 2000' '-ADMaxValue 32767' '-ADBitVolts 4.57778e-008 ' '' '-NumADChannels 1' '-ADChannel 4 ' '-InputRange 1500 ' '-InputInverted True' '-DSPLowCutFilterEnabled True' '-DspLowCutFrequency 1' '-DspLowCutNumTaps 0' '-DspLowCutFilterType DCO' '-DSPHighCutFilterEnabled True' '-DspHighCutFrequency 425' '-DspHighCutNumTaps 128' '-DspHighCutFilterType FIR' '-DspDelayCompensation Disabled' '-DspFilterDelay_µs 1984' >> </code> You should recognize at least a few of the phrases used: for instance, we see that the sampling frequency (often abbreviated as Fs) is given as 2000 (samples per second, or Hz). Important also is the ''InputRange'', given as 1500 (microvolts, uV), to indicate that the maximum value (''ADMaxValue'', or 32767) in the raw data corresponds to an actual signal magnitude of 1500uV. Ignore the other fields for now. ☛ The waveforms of extracellularly recorded action potentials ("spikes") typically contain frequency components in the 1000-5000Hz range. Do you expect to find spikes in this data file? Why not? The ''ADMaxValue'' of 32767 reflects the fact that Neuralynx stores data as 16-bit integers (see the discussion in the Leis chapter if this doesn't mean anything to you). Thus, the smallest value it can store is -32768, for a total range of 65536 (16 bits). This means that what is actually a continuous variable -- voltage -- is stored with finite precision. ☛ Given Neuralynx's 16-bit data format and the specified range of +/- 1500uV for this data session, compute the smallest voltage change (in uV) that can be resolved. ☛ Using this knowledge, convert the ''Samples'' variable, which is specified in AD values, to millivolts. (Note that the MATLAB data type is given as ''double'' but the values are actually ''int16''.) Notice that the ''ADBitVolts'' field in the header can be used for this unit conversion. === Temporal structure of Neuralynx data === Inspect the ''Timestamps'' variable. As with the position data in the previous module, this is given in microseconds (us). ☛ Convert ''Timestamps'' to seconds (s). You may have noticed that the ''Samples'' variable is size [512 x 10761] and ''Timestamps'' is [1 x 10761]. As it turns out, Neuralynx data (of the "continuously sampled" type) is stored in blocks of 512 samples. Only the first sample of each block is timestamped. ☛ Compute the total number of samples, and from this number and the sampling frequency, the total time that would be sampled continuously if all samples were acquired without any gaps. Compare this number with the actual time elapsed between the first and last ''Timestamps''. What do you conclude? In fact, there are several gaps in the data. Standard recording protocol requires a "pre-recording" session, followed by a pause, then the actual recording session, another pause, and a "post-recording" session. This can be seen easily by plotting the difference between each sample and its predecessor (''plot(diff(Timestamps))''). We wish to ''Restrict'' the data to only those samples taken when the rat was running on the track. Promoted (i.e. preprocessed and annotated) data folders always have an "ExpKeys" file with some useful metadata, including ''TimeOnTrack'' and ''TimeOffTrack'' values: <code matlab> >> run(FindFile('*keys.m')) >> ExpKeys ExpKeys = BehaviorOrder: {'Value' 'Risk'} Protocol: 'Hyperdrive' Target: {'Striatum' 'Hippocampus'} Target2: {'Ventral' 'CA1'} TetrodeTargets: [2 2 1 1 1 1] TetrodeDepths: [2000 1960 6300 6380 6460 1920] TimeOnTrack: [1030 2285] TimeOffTrack: [2268 3194] Delay: [0.5000 0.5000] goodGamma: {'R016-2012-10-03-CSC04d.ncs' [1x26 char]} goodSWR: {'R016-2012-10-03-CSC02b.ncs'} goodTheta: {'R016-2012-10-03-CSC02b.ncs'} CueToneMap: {'S3' 'S2' 'S4' 'S1' 'S2' 'S5'} </code> In fact this data contains two recording sessions, called 'Value' and 'Risk' respectively (this refers to the distributions of food outcomes predicted by audio cues presented as the rat crossed the center of the track; we will not use this information for now). These sessions map onto the first and second elements of ''TimeOnTrack'' and ''TimeOffTrack'', which give the times (in seconds) of when the Value and Risk sessions started and ended, respectively. ☛ Use the first element of ''ExpKeys.TimeOnTrack'' and ''ExpKeys.TimeOffTrack'' to find the indices of ''Timestamps'' corresponding to the Value session. Then, use these to create a new set of variables ''TimestampsValue'', ''SamplesValue'' et cetera. (Note that this is essentially what ''Restrict()'' does; If you are confused by this, review the documentation on [[http://www.mathworks.com/help/matlab/math/matrix-indexing.html|Matrix Indexing]].) ☛ Plot the differences between the resulting timestamps. You should see: {{ :analysis:course:week3_fig4.png?600 |}} The most common value between timestamps seems to be about 0.26 s. Recall that these timestamps are for the start of a block of 512 values. ☛ What is the expected difference between 512-sample timestamps if ''Fs'' is 2kHz? Let's test if this indeed the most common value in these data: <code matlab> >> 512.*(1/2000) == mode(diff(TimestampsValue)) ans = 0 </code> Hmm. (You can ask MATLAB ''why'', but don't expect an informative answer!) ☛ Use ''format long'' to change MATLAB's default display, and inspect the above values to determine the source of the difference. Apparently the typical elapsed time between two 512-sample blocks does not correspond //exactly// to what would be expected if Fs equals 2kHz. ☛ Compute Neuralynx's true sampling rate from the observed mode of the timestamp diffs. Close enough for practical purposes. However, what is up with these clearly smaller values in the diff plot? Let's investigate: <code matlab> plot(diff(TimestampsValue)) hold on; plot(NumberOfValidSamplesValue == 512,'r') </code> If you zoom in, you should find that the odd timestamp diffs occur for those sample blocks that have a number of valid samples that is not 512. ☛ Find the indices of these crippled blocks. Look at the ''Samples'' inside a few of these to see what Neuralynx does with invalid samples. Tricky! How would we know to exclude these invalid samples? Fortunately, we can do so using the ''NumberOfValidSamples'' variable, which tells us how many are good and can be included. The current version of ''LoadCSC()'' that you have been using does not deal with invalid samples correctly. ☛ To see this, load the same CSC using ''LoadCSC()''. Plot the diff of the time axis of the resulting ''tsd''; if you zoom in, it will be possible to see there are negative diffs (you can also verify this on the command line with something like ''any(diff(Range(csc)) < 0)''). You may also be able to find some segments of the data that are just filled with zeros as you have seen above. Neither of these behaviors (negative time differences between subsequent samples, and the inclusion of zeros) is desirable. We will fix this in a later section of this module. ==== Interlude: object-oriented programming in MATLAB ==== MATLAB currently supports two different ways to support OOP, the "old" way and the "new" way. Details are discussed in the page linked at the top; for now we will stick with the old way, because that's how the codebase is built until we convert it all over some day. In this scheme, each MATLAB object definition or "class" has a corresponding folder starting with @. This folder contains the //constructor//, i.e. the function called to instantiate a new object, along with whatever other object-specific functions (//methods//) this object supports. ☛ Look at the code for the ''tsd'' constructor. (You can find its location using ''which tsd''; the constructor has to have the same name as the class itself.) Focus on the case where there are 2 input arguments and ignore the rest. Notice that all the constructor seems to be doing is putting the input arguments into corresponding fields. (For a brief reminder about fields and structs in MATLAB, if you don't recognize this notation, see [[http://en.wikibooks.org/wiki/MATLAB_Programming/Struct_Arrays|here]].) There is however a key statement at the end that formally defines the output of the constructor as a ''tsd'' object, ''tsa = class(tsa, 'tsd');''. Without this statement, the constructor would return a ''struct'' rather than a ''tsd'' instance. Recall that when you call the constructor (for instance, by doing ''Xtsd = tsd(Timestamps,X')'' as in the previous module), you **instantiate** a ''tsd'' object. There can be many instances of ''tsd'' objects (''Ytsd = tsd(Timestamps,Y')'' would create another, different one) but only one object **definition** or class (implemented in the constructor). ☛ Now look at the code for ''Range()''. Again, this is pretty basic -- all this does is return whatever is in the ''t'' field of the tsd object. === Overloading === Let's use our knowledge about MATLAB objects to save some time. Currently if we want to plot the data in a ''tsd'' object, we have to do ''plot(Range(csc),Data(csc))''. This is cumbersome; we would like to simply do ''plot(csc)'' ☛ Try ''plot(csc)'', noting that MATLAB returns an error. It says essentially, "Hi, I am ''plot()''. I can only plot doubles, and I don't know how to make those out of this weird tsd thing you gave me." We can define a ''plot()'' method for ''tsd'' objects, which MATLAB will call when we try to ''plot(csc)''. For this to work, recall that we would need to place this function inside the ''@tsd'' folder. This is how MATLAB will know that we don't want to use this function when we do something like ''plot(rand(100,1))''; it will only use our new ''plot()'' when we give it a ''tsd'' object as an argument. This is the idea of **overloading**: there are multiple versions of the ''plot()'' function -- it is overloaded -- with the appropriate one called depending on the class of the input. ☛ Create a plot() function inside the ''@tsd'' folder as follows: <code matlab> function plot(tsd_in) % function plot(tsd_in) % % plots the data in a tsd object plot(tsd_in.t,tsd_in.data); </code> ☛ Test if it works by creating a tsd object and plotting it. The above is a pretty basic function definition. To mention a few shortcomings, it doesn't give back a handle, so that we can easily access the properties of the resulting graphics object. We also can't use any of the optional arguments that work with the standard ''plot()'': something like ''plot(mytsd,'r')'' will give an error (why?). ☛ Modify your tsd ''plot()'' function to accept any arguments also accepted by the standard plot() function, and to return the handle from plot() itself. (Hint: look at the example given in ''help varargin''). If you can do this you are ready for the final section. ==== Assignment ==== Implement an improved version of ''LoadCSC()'' that handles missing data (as flagged by ''NumberOfValidSamples'') correctly, as follows: <code matlab> function csc = LoadCSC(fname) % function csc = LoadCSC(fname) % % load a Neuralynx .ncs file correctly % % INPUTS: % fname: [1 x n] char, i.e. a filename such as 'R042-2013-08-18-CSC01a.ncs' % % OUTPUTS: % % csc: [1 x 1] mytsd </code> Thus, your function should return a ''mytsd'' object; define this to contain the timestamps (in seconds), the values of the corresponding samples (in millivolts), and the file header (called ''csc_info'' in the original ''LoadCSC()''). Unlike the original ''LoadCSC()'', your new function should not return timestamps with negative diffs, and it should not return invalid samples. Recall that this means you will need to create a @mytsd folder with a constructor. Along with ''mytsd()'', also implement ''Range()'', ''Data()'', ''getHeader()'', and ''plot()''. Compare your results with the output of the old ''LoadCSC'' and identify where they diverge. Push your work (the ''@mytsd'' folder, your updated ''LoadCSC()'', and a script that demonstrates functionality. For extra brownie points, consider what additional features may be desirable for a loading function such as ''LoadCSC()''. You can add your thoughts as comments or as actual implementation in the code.

wiki

User Tools

Site Tools

Sidebar

Page Tools