analysis:course-w16:week4

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision | ||

analysis:course-w16:week4 [2016/01/21 12:24] mvdm [Anatomy of time series data, sampling theory] |
analysis:course-w16:week4 [2018/07/07 10:19] (current) |
||
---|---|---|---|

Line 1: | Line 1: | ||

~~DISCUSSION~~ | ~~DISCUSSION~~ | ||

- | |||

- | :!: **Under construction, do not use!** :!: | ||

===== Anatomy of time series data, sampling theory ===== | ===== Anatomy of time series data, sampling theory ===== | ||

Line 90: | Line 88: | ||

☛ To see this, change ''Fs2'' in the code above to 20, and plot the result. What do you see? | ☛ To see this, change ''Fs2'' in the code above to 20, and plot the result. What do you see? | ||

- | Thus, a (very safe) rule of thumb is that you should acquire at a sampling frequency four times the minimum signal frequency you are interested in detecting. | + | Thus, a (very safe) rule of thumb is to acquire data at a sampling frequency four times the minimum signal frequency you are interested in detecting. |

==== Subsampling (decimating) time series data ==== | ==== Subsampling (decimating) time series data ==== | ||

- | ==== Reconstructing a signal from sampled data ==== | + | In the real world, the frequency at which we can acquire data will be limited by the properties of your experimental equipment. For instance, the maximum sampling rate on a typical Neuralynx system is 32 kHz. Thus, the highest-frequency signal we can detect is 16 kHz (the Nyquist frequency). Crucially, however, we cannot rule out the possibility that frequencies above 16 kHz are present in the signal we are sampling from! Thus, we risk **aliasing**: generating "phantom" frequencies in our sampled data that don't exist in the true signal. What to do? |

+ | | ||

+ | The general solution is to apply an //anti-aliasing filter// to the data before sampling. To illustrate this, let's generate a signal consisting of two frequencies: | ||

+ | | ||

+ | <code matlab> | ||

+ | Fs1 = 1200; | ||

+ | F1 = 3; F2 = 10; | ||

+ | twin = [0 1]; | ||

+ | | ||

+ | tvec1 = twin(1):1/Fs1:twin(2); | ||

+ | signal1a = sin(2*pi*F1*tvec1); signal1b = 0.5*sin(2*pi*F2*tvec1); | ||

+ | signal1 = signal1a + signal1b; | ||

+ | </code> | ||

+ | | ||

+ | Now, if we sample this signal at 12 Hz, by taking every 100th sample (note ''Fs'' is set to 1200), we will get aliasing due to the fact that our sampling frequency is less than twice that of the highest frequency in the signal (10 Hz): | ||

+ | | ||

+ | <code matlab> | ||

+ | dt = 100; | ||

+ | tvec2 = tvec1(1:dt:end); | ||

+ | signal2 = signal1(1:dt:end); % sample at 12 Hz - every 100th sample | ||

+ | | ||

+ | subplot(131) | ||

+ | plot(tvec1,signal1); | ||

+ | hold on; | ||

+ | stem(tvec2,signal2,'.g','MarkerSize',20); | ||

+ | title('without anti-aliasing filter'); | ||

+ | </code> | ||

+ | | ||

+ | As you can see in the resulting plot, the sampled data (green dots) isn't a clean 3Hz sine wave as we would hope to get -- it is contaminated by an alias of the 10Hz component in the underlying signal. Thus, our naive sampling method of taking every 100th sample is dangerous! Here is how to do it properly: | ||

+ | | ||

+ | <code matlab> | ||

+ | % sample at 12 Hz with different method | ||

+ | tvec1d = decimate(tvec1, dt); | ||

+ | signal2d = decimate(signal1,dt); | ||

+ | | ||

+ | subplot(132) | ||

+ | plot(tvec1,signal1a,'b--'); | ||

+ | hold on; | ||

+ | stem(tvec1d,signal2d,'.g','MarkerSize',20); | ||

+ | xlabel('time (s)'); ylabel('y'); | ||

+ | title('with anti-aliasing filter'); | ||

+ | | ||

+ | subplot(133) | ||

+ | plot(tvec2,signal2-signal2d,'r','LineWidth',2); | ||

+ | title('difference'); | ||

+ | </code> | ||

+ | | ||

+ | You should get: | ||

+ | | ||

+ | {{ :analysis:course-w16:alias3.png?nolink&900 |}} | ||

+ | | ||

+ | Note that after using ''decimate()'' instead of directly taking each 100th sample, we recover a relatively clean 3Hz sine (except for some edge effects; more on that in later modules). You can see the difference between the two sampling methods are substantial! | ||

+ | | ||

+ | The way ''decimate()'' works is that it first applies a filter to the data, removing any frequencies that could cause aliases (i.e. anything with a frequency of at least half the new sampling frequency). We will explore filters in more detail in Module 6. | ||

+ | | ||

+ | Data acquisition systems use the same principle. If you look in the settings of your system, you should be able to find the details of the filter applied to the data by default. It should have an upper cut-off frequency set to prevent aliasing. For instance, for some Neuralynx systems, all data above 9000 Hz is filtered out. | ||

+ | | ||

+ | ☛ Suppose you acquired some nice neural data at 2 kHz, and now you decide to analyze frequencies in the 50-100 Hz range. To speed up your analysis and save space, you want to downsample your data to 500 Hz: a good choice, more than adequately above the Nyquist frequency. What MATLAB command would you use to downsample this data, and why? (Hint: what is wrong with the ''downsample()'' function you may be tempted to use?) | ||

+ | ==== Reconstructing a signal from sampled data (optional) ==== | ||

+ | | ||

+ | You may have noticed that the sampled signals above often look a bit choppy, losing the smoothness of the originals. As discussed in the Leis book, we can try to recover the original using interpolation. Let's illustrate this by first constructing a 100 Hz signal, sampled at 2 kHz: | ||

+ | | ||

+ | <code matlab> | ||

+ | fs1 = 2000; | ||

+ | tvec = 0:1/fs1:4; % construct time axis, sampled at 2kHz | ||

+ | | ||

+ | freq1 = 100; | ||

+ | y = sin(2*pi*freq1*tvec); % 100Hz signal | ||

+ | | ||

+ | ax1 = subplot(211); | ||

+ | stem(tvec,y); title('original'); | ||

+ | </code> | ||

+ | | ||

+ | Say we wish to subsample down to 500Hz to save space. Naively, we might simply take every 4th sample (in fact, this is what the MATLAB function downsample() does): | ||

+ | | ||

+ | <code matlab> | ||

+ | subsample_factor = 4; | ||

+ | | ||

+ | tvec2 = tvec(1:subsample_factor:end); % take every 4th sample | ||

+ | y2 = y(1:subsample_factor:end); | ||

+ | | ||

+ | ax2 = subplot(212); | ||

+ | stem(tvec2,y2,'r'); title('subsampled'); | ||

+ | xlabel('time (s)'); | ||

+ | </code> | ||

+ | | ||

+ | ☛ Why is it okay to subsample this way in this case? | ||

+ | | ||

+ | It's hard to see what is going on when we are so zoomed out. Let's use the axes handles introduced last week to rectify this: | ||

+ | | ||

+ | <code matlab> | ||

+ | xl = [1 1.04]; | ||

+ | linkaxes([ax1, ax2], 'x'); | ||

+ | set(ax1,'XLim',xl); % see what I did there?) | ||

+ | </code> | ||

+ | | ||

+ | That's better. Notice that in the subsampled plot, you can detect the presence of the 100Hz signal by noticing that each sample repeats (has the same value) every 10ms. However, it doesn't look much like the original – as discussed in the Leis book, we can attempt to reconstruct it: | ||

+ | | ||

+ | <code matlab> | ||

+ | hold on; | ||

+ | | ||

+ | y_interp = interp1(tvec2,y2,tvec,'linear'); | ||

+ | p1 = plot(tvec,y_interp,'b'); | ||

+ | | ||

+ | y_interp2 = interp1(tvec2,y2,tvec,'spline'); | ||

+ | p2 = plot(tvec,y_interp2,'g'); | ||

+ | | ||

+ | legend([p1 p2],{'linear','spline'},'Location','Northeast'); legend boxoff | ||

+ | </code> | ||

+ | | ||

+ | You should obtain something like: | ||

+ | | ||

+ | {{ :analysis:course-w16:spline_recover.png?nolink&600 |}} | ||

+ | | ||

+ | Notice how the spline-interpolated sampled signal is a pretty good approximation to the original. In cases where you care about detecting the values and/or locations of signal peaks, such as during spike sorting, performing spline interpolation can often improve accuracy substantially! | ||

==== Detailed examination of Neuralynx time series data ==== | ==== Detailed examination of Neuralynx time series data ==== | ||

+ | |||

+ | This section will look in some detail at how raw time series data is stored by the Neuralynx system. Even if you do not use this system in your own work, the lessons that can be learned from looking at what can go wrong at the raw data level already are universal! | ||

+ | |||

+ | To get into the guts of actual Neuralynx data, we will not use the sanitized wrapper provided by ''LoadCSC()'' but instead use the loading function provided by Neuralynx. Using cell mode in a sandbox file as usual, ''cd'' into the ''R016-2012-10-08'' data folder you downloaded previously in Week 1. Then deploy the Neuralynx loader: | ||

+ | |||

+ | <code matlab> | ||

+ | %% | ||

+ | % cd to your location here | ||

+ | fname = 'R016-2012-10-08-CSC03b.ncs'; | ||

+ | [Timestamps, ~, SampleFrequencies, NumberOfValidSamples, Samples, Header] = Nlx2MatCSC(fname, [1 1 1 1 1], 1, 1, []); | ||

+ | </code> | ||

+ | |||

+ | Note the use of the ''~'' output argument, which indicates that we do not wish to assign the second output of the ''Nlx2MatCSC'' function to a variable. | ||

+ | |||

+ | Most data files come with a "header" that describes some properties of the data. Let's look at the header of our file, which is a LFP recorded from the ventral striatum (nucleus accumbens core) of a rat running back and forth along a linear track: | ||

+ | |||

+ | <code matlab> | ||

+ | >> Header | ||

+ | |||

+ | Header = | ||

+ | |||

+ | '######## Neuralynx Data File Header' | ||

+ | '## File Name C:\CheetahData\2012-10-08_11-24-30\CSC10.ncs' | ||

+ | '## Time Opened (m/d/y): 10/8/2012 (h:m:s.ms) 11:25:2.79' | ||

+ | '## Time Closed (m/d/y): 10/8/2012 (h:m:s.ms) 12:32:9.61' | ||

+ | '-CheetahRev 5.5.1 ' | ||

+ | '' | ||

+ | '-AcqEntName CSC10' | ||

+ | '-FileType CSC' | ||

+ | '-RecordSize 1044' | ||

+ | '' | ||

+ | '-HardwareSubSystemName AcqSystem1' | ||

+ | '-HardwareSubSystemType DigitalLynxSX' | ||

+ | '-SamplingFrequency 2000' | ||

+ | '-ADMaxValue 32767' | ||

+ | '-ADBitVolts 4.57778e-008 ' | ||

+ | '' | ||

+ | '-NumADChannels 1' | ||

+ | '-ADChannel 4 ' | ||

+ | '-InputRange 1500 ' | ||

+ | '-InputInverted True' | ||

+ | '-DSPLowCutFilterEnabled True' | ||

+ | '-DspLowCutFrequency 1' | ||

+ | '-DspLowCutNumTaps 0' | ||

+ | '-DspLowCutFilterType DCO' | ||

+ | '-DSPHighCutFilterEnabled True' | ||

+ | '-DspHighCutFrequency 425' | ||

+ | '-DspHighCutNumTaps 128' | ||

+ | '-DspHighCutFilterType FIR' | ||

+ | '-DspDelayCompensation Disabled' | ||

+ | '-DspFilterDelay_µs 1984' | ||

+ | |||

+ | >> | ||

+ | </code> | ||

+ | |||

+ | You should recognize at least a few of the phrases used: for instance, we see that the sampling frequency (often abbreviated as Fs) is given as 2000 (samples per second, or Hz). Important also is the ''InputRange'', given as 1500 (microvolts, uV), to indicate that the maximum value (''ADMaxValue'', or 32767) in the raw data corresponds to an actual signal magnitude of 1500uV. Ignore the other fields for now. | ||

+ | |||

+ | ☛ The waveforms of extracellularly recorded action potentials ("spikes") typically contain frequency components in the 1000-5000Hz range. Do you expect to find spikes in this data file? Why not? | ||

+ | |||

+ | The ''ADMaxValue'' of 32767 reflects the fact that Neuralynx stores data as 16-bit integers (see the discussion in the Leis chapter if this doesn't mean anything to you). Thus, the smallest value it can store is -32768, for a total range of 65536 (16 bits). This means that what is actually a continuous variable -- voltage -- is stored with finite precision. | ||

+ | |||

+ | ☛ Given Neuralynx's 16-bit data format and the specified range of +/- 1500uV for this data session, compute the smallest voltage change (in uV) that can be resolved. | ||

+ | |||

+ | ☛ Using this knowledge, convert the ''Samples'' variable, which is specified in AD values, to millivolts. (Note that the MATLAB data type is given as ''double'' but the values are actually ''int16''.) Notice that the ''ADBitVolts'' field in the header can be used for this unit conversion. | ||

+ | |||

+ | Inspect the ''Timestamps'' variable, which is given in microseconds (us). | ||

+ | |||

+ | ☛ Convert ''Timestamps'' to seconds (s). | ||

+ | |||

+ | You may have noticed that the ''Samples'' variable is size [512 x 10761] and ''Timestamps'' is [1 x 10761]. As it turns out, Neuralynx data (of the "continuously sampled" type) is stored in blocks of 512 samples. Only the first sample of each block is timestamped. | ||

+ | |||

+ | ☛ Compute the total number of samples, and from this number and the sampling frequency, the total time that would be sampled continuously if all samples were acquired without any gaps. Compare this number with the actual time elapsed between the first and last ''Timestamps''. What do you conclude? | ||

+ | |||

+ | In fact, there are several gaps in the data. Our standard recording protocol requires a "pre-recording" session, followed by a pause, then the actual recording session, another pause, and a "post-recording" session. This can be seen easily by plotting the difference between each sample and its predecessor (''plot(diff(Timestamps))''). | ||

+ | |||

+ | We wish to restrict the data to only those samples taken when the rat was running on the track. Promoted (i.e. preprocessed and annotated) data folders always have an "ExpKeys" file with some useful metadata, including ''TimeOnTrack'' and ''TimeOffTrack'' values: | ||

+ | |||

+ | <code matlab> | ||

+ | >> LoadExpKeys | ||

+ | >> ExpKeys | ||

+ | |||

+ | ExpKeys = | ||

+ | |||

+ | BehaviorOrder: {'Value' 'Risk'} | ||

+ | Protocol: 'Hyperdrive' | ||

+ | Target: {'Striatum' 'Hippocampus'} | ||

+ | Target2: {'Ventral' 'CA1'} | ||

+ | TetrodeTargets: [2 2 1 1 1 1] | ||

+ | TetrodeDepths: [2000 1960 6300 6380 6460 1920] | ||

+ | TimeOnTrack: [1030 2285] | ||

+ | TimeOffTrack: [2268 3194] | ||

+ | Delay: [0.5000 0.5000] | ||

+ | goodGamma: {'R016-2012-10-03-CSC04d.ncs' [1x26 char]} | ||

+ | goodSWR: {'R016-2012-10-03-CSC02b.ncs'} | ||

+ | goodTheta: {'R016-2012-10-03-CSC02b.ncs'} | ||

+ | CueToneMap: {'S3' 'S2' 'S4' 'S1' 'S2' 'S5'} | ||

+ | </code> | ||

+ | |||

+ | In fact this data contains two recording sessions, called 'Value' and 'Risk' respectively (this refers to the distributions of food outcomes predicted by audio cues presented as the rat crossed the center of the track; we will not use this information for now, but the full task is described in the [[http://onlinelibrary.wiley.com/doi/10.1111/ejn.13069/fullpaper | paper]]). These sessions map onto the first and second elements of ''TimeOnTrack'' and ''TimeOffTrack'', which give the times (in seconds) of when the Value and Risk sessions started and ended, respectively. | ||

+ | |||

+ | ☛ Use the first element of ''ExpKeys.TimeOnTrack'' and ''ExpKeys.TimeOffTrack'' to find the indices of ''Timestamps'' corresponding to the Value session. Then, use these to create a new set of variables ''TimestampsValue'', ''SamplesValue'' et cetera. (Note that this is essentially what ''restrict()'' does; If you are confused by this, review the documentation on [[http://www.mathworks.com/help/matlab/math/matrix-indexing.html|Matrix Indexing]].) | ||

+ | |||

+ | ☛ Plot the differences between the resulting timestamps (Hint: MATLAB's ''diff()'' function is useful here!). You should see: | ||

+ | |||

+ | {{ :analysis:course:week3_fig4.png?600 |}} | ||

+ | |||

+ | The most common value between timestamps seems to be about 0.26 s. Recall that these timestamps are for the start of a block of 512 values. | ||

+ | |||

+ | ☛ What is the expected difference between 512-sample timestamps if ''Fs'' is 2kHz? | ||

+ | |||

+ | Let's test if this indeed the most common value in these data: | ||

+ | |||

+ | <code matlab> | ||

+ | >> 512.*(1/2000) == mode(diff(TimestampsValue)) | ||

+ | |||

+ | ans = | ||

+ | |||

+ | 0 | ||

+ | </code> | ||

+ | |||

+ | Hmm. (You can ask MATLAB ''why'', but don't expect an informative answer!) | ||

+ | |||

+ | ☛ Use ''format long'' to change MATLAB's default display, and inspect the above values to determine the source of the difference. | ||

+ | |||

+ | Apparently the typical elapsed time between two 512-sample blocks does not correspond //exactly// to what would be expected if Fs equals 2kHz. | ||

+ | |||

+ | ☛ Compute Neuralynx's true sampling rate from the observed mode of the timestamp diffs. | ||

+ | |||

+ | Close enough for practical purposes, but the differences could become significant for very long recording sessions! | ||

+ | |||

+ | Next: what is up with these clearly smaller values in the diff plot? Let's investigate: | ||

+ | |||

+ | <code matlab> | ||

+ | plot(diff(TimestampsValue)) | ||

+ | hold on; | ||

+ | plot(NumberOfValidSamplesValue == 512,'r') | ||

+ | </code> | ||

+ | |||

+ | If you zoom in, you should find that the odd timestamp diffs occur for those sample blocks that have a number of valid samples that is not 512. | ||

+ | |||

+ | ☛ Find the indices of these crippled blocks. Look at the ''Samples'' inside a few of these to see what Neuralynx does with invalid samples. | ||

+ | |||

+ | Tricky! How would we know to exclude these invalid samples? Fortunately, we can do so using the ''NumberOfValidSamples'' variable, which tells us how many are good and can be included. | ||

+ | |||

+ | ☛ How does the ''LoadCSC()'' function handle these cases? | ||

+ | |||

+ | :!: NOTE: the above missing sample weirdness was a rare occurrence for our lab's Neuralynx system; one that was traced to a faulty framegrabber board driver which caused the computer to lock up periodically. Thanks to Neuralynx's warning and error reporting system in the acquisition software, we were immediately alerted that something unexpected was happening. In addition, the ''*events.Nev'' file contains event strings indicating suspect data blocks. | ||

+ | |||

+ | ==== Challenges ==== | ||

+ | |||

+ | ★ If you have your own time series data, find out how it is stored: with what precision? In blocks? Does the reported sampling rate match up with what is in the data? How can you convert from the raw data format to voltage (or whatever the quantity you are measuring is)? | ||

+ | |||

+ | ★ If you implemented your own file loader(s) back in Module 2, implement checks for missing samples and possible sampling frequency misalignments. | ||

+ | |||

+ | ★ Important! If you have your own idea of something you'd like to accomplish in this course, even if is isn't listed as an official challenge, ask me and we can make it count as one. What you do in this course should be as relevant as possible to your work! |

analysis/course-w16/week4.1453397092.txt.gz · Last modified: 2018/07/07 10:19 (external edit)

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International