analysis:course-w16:week4

This shows you the differences between two versions of the page.

analysis:course-w16:week4 [2016/01/21 13:10] mvdm [Subsampling (decimating) time series data] |
analysis:course-w16:week4 [2018/07/07 10:19] |
||
---|---|---|---|

Line 1: | Line 1: | ||

- | ~~DISCUSSION~~ | ||

- | :!: **Under construction, do not use!** :!: | ||

- | |||

- | ===== Anatomy of time series data, sampling theory ===== | ||

- | |||

- | Goals: | ||

- | |||

- | * Understand the fundamentals of sampling theory: Nyquist frequency, aliasing | ||

- | * Learn why and how to use anti-aliasing filters | ||

- | * Reconstruct a signal from sampled data | ||

- | * Examine the file structure of Neuralynx continuously sampled data in detail | ||

- | |||

- | Resources: | ||

- | |||

- | * (intuitive background) [[http://redwood.berkeley.edu/bruno/npb261/aliasing.pdf | nice, quick intro]] to aliasing by Bruno Olshausen, with some connections to the human visual system | ||

- | * (more technical background, optional) read Chapter 3 of the [[analysis:course-w16|Leis book]]. Skip sections 3.4.3, 3.4.4, 3.4.5, 3.4.6, 3.4.7, 3.7. Skim section 3.6. | ||

- | ==== Introductory remarks ==== | ||

- | |||

- | Systems for the acquisition of neural data give us a necessarily imperfect view of the brain. Some of these limitations are obvious, such as a finite number of recording sites and contamination of the data with movement artifacts and system noise. Other limitations are perhaps more subtle, and include the pitfalls associated with limited (temporal) sampling of the underlying process and the properties of specific recording systems. | ||

- | |||

- | An important phrase to remember in any data analysis workflow is [[analysis:nsb2015:week0|"garbage in, garbage out."]] Making sure that your input is not garbage starts with a thorough understanding of the recorded signal and how it is stored. (Of course, if you abuse it enough, even the best data can rapidly turn to garbage. But that is beyond the scope of this module, which deals with the raw data.) | ||

- | |||

- | ==== Motivating example: aliasing ==== | ||

- | |||

- | Before you begin, do a ''git pull'' from the course repository. Also, to reproduce the figures shown here, change the default font size (''set(0,'DefaultAxesFontSize',18)'' -- a good place to put this is in your path shortcut). | ||

- | |||

- | Let's start with an example that illustrates what can go wrong if you are not aware of some basic sampling theory ideas. To do so, we will first construct a 10Hz signal, sampled at 1000Hz. Recalling that the frequency //f// of a sine wave is given by $y = sin(2 \pi f t)$: | ||

- | |||

- | <code matlab> | ||

- | Fs1 = 1000; % Fs is the conventional variable name for sampling freq | ||

- | F1 = 10; twin = [0 1]; % use a 1-second time window (from 0 to 1s) | ||

- | |||

- | tvec1 = twin(1):1/Fs1:twin(2); % timebase for signal | ||

- | signal1 = sin(2*pi*F1*tvec1); | ||

- | </code> | ||

- | |||

- | Notice the general approach in defining time series data: we first construct a timebase (conventionally named ''tvec'', or ''t'') and then the signal. | ||

- | |||

- | โ Plot the ''signal1'' variable and verify that the result matches your expectation: for instance, it should have a specific number of peaks. | ||

- | |||

- | Let's say we are going to sample this signal at **12Hz**: | ||

- | |||

- | <code matlab> | ||

- | Fs2 = 12; | ||

- | tvec2 = twin(1):1/Fs2:twin(2); | ||

- | signal2 = interp1(tvec1,signal1,tvec2,'nearest'); | ||

- | </code> | ||

- | |||

- | Note the use of the ''interp1()'' function. This is a very important and frequently used command with some interesting options we'll explore later. For now, we are telling it something like, "we have an input signal, specified by ''tvec1'' and ''signal1''; return those values of ''signal1'' for those values of ''tvec1'' closest to those in ''tvec2''. An intuition for that is illustrated in the figure below -- basically you want to get a value for some time (the red dot) that isn't explicitly present in your signal (the blue stems): | ||

- | |||

- | {{ :analysis:course-w16:interp_example.png?nolink&600 |}} | ||

- | |||

- | Now that we know how the ''signal2'' variable -- the signal we are seeing by sampling at 12 Hz -- is obtained, we can plot it: | ||

- | |||

- | <code matlab> | ||

- | plot(tvec1,signal1); | ||

- | hold on; | ||

- | plot(tvec2,signal2,'.g','MarkerSize',20); | ||

- | plot(tvec1,-sin(2*pi*2*tvec1),'r--','LineWidth',2); | ||

- | xlabel('time (s)'); ylabel('y'); | ||

- | </code> | ||

- | |||

- | You should see: | ||

- | |||

- | {{ :analysis:course-w16:alias1.png?nolink&600 |}} | ||

- | |||

- | The individual data points we obtain with our 12Hz sampling are shown in green. Notice that it describes a periodic signal (red dashed line) with a frequency **nothing like** the 10Hz in the original signal (blue sine wave)! We might be tempted to conclude that we are seeing a 2Hz signal, when in fact there is no such thing in the original data. This effect is known as **aliasing** (i.e. the existence of multiple underlying signals which could produce a given set of samples), and it can happen when certain conditions are met, which we will explore next. | ||

- | |||

- | ==== The Nyquist limit ==== | ||

- | |||

- | The **Nyquist sampling theorem** states that the minimum sampling frequency required to avoid aliasing is **twice** the frequency of the signal, i.e. | ||

- | |||

- | $$ F_s \geq 2*F_{orig} $$ | ||

- | |||

- | This **Nyquist criterion** is intuitive, when you plot the signal that can be obtained by sampling our original signal at 20Hz, i.e. twice the frequency: | ||

- | |||

- | {{ :analysis:course-w16:alias2.png?nolink&600 |}} | ||

- | |||

- | Note that we obtain a signal which is periodic (repeating) at 10Hz, as we know to be correct. | ||

- | |||

- | An important consequence of the Nyquist criterion is that | ||

- | |||

- | $$ F_{orig} \leq F_s / 2 $$ | ||

- | |||

- | In other words, the highest-frequency signal that can be recovered when sampling is //half// the sampling frequency. This frequency is sometimes referred to as the **Nyquist frequency** or **Nyquist limit** (the precise terminology gets a bit confused, but in neuroscience they are interpreted as referring to the same thing). So, for instance, digital audio on CDs and most digital file formats is typically represented with Fs = 44100 Hz, which means that the highest frequency that theoretically can be reproduced is 22050 Hz, corresponding approximately to the upper limit of human hearing. | ||

- | |||

- | However, it is best not to cut things too fine: say you are interested in detecting a "gamma" frequency of 75Hz with your EEG recording system. If you acquire data at Fs = 150 Hz, you might run into problems. | ||

- | |||

- | โ To see this, change ''Fs2'' in the code above to 20, and plot the result. What do you see? | ||

- | |||

- | Thus, a (very safe) rule of thumb is to acquire data at a sampling frequency four times the minimum signal frequency you are interested in detecting. | ||

- | ==== Subsampling (decimating) time series data ==== | ||

- | |||

- | In the real world, the frequency at which we can acquire data will be limited the properties of your experimental equipment. For instance, the maximum sampling rate on a typical Neuralynx system is 32 kHz. Thus, the highest-frequency signal we can detect is 16 kHz (the Nyquist frequency). Crucially, however, we cannot rule out the possibility that frequencies above 16 kHz are present in the signal we are sampling from! Thus, we risk **aliasing**: generating "phantom" frequencies in our sampled data that don't exist in the true signal. What to do? | ||

- | |||

- | The general solution is to apply an //anti-aliasing filter// to the data before sampling. To illustrate this, let's generate a signal consisting of two frequencies: | ||

- | |||

- | <code matlab> | ||

- | Fs1 = 1200; | ||

- | F1 = 3; F2 = 10; | ||

- | twin = [0 1]; | ||

- | |||

- | tvec1 = twin(1):1/Fs1:twin(2); | ||

- | signal1a = sin(2*pi*F1*tvec1); signal1b = 0.5*sin(2*pi*F2*tvec1); | ||

- | signal1 = signal1a + signal1b; | ||

- | </code> | ||

- | |||

- | Now, if we sample this signal at 12 Hz, by taking every 100th sample (note ''Fs'' is set to 1200), we will get aliasing due to the fact that our sampling frequency is less than twice that of the highest frequency in the signal (10 Hz): | ||

- | |||

- | <code matlab> | ||

- | dt = 100; | ||

- | tvec2 = tvec1(1:dt:end); | ||

- | signal2 = signal1(1:dt:end); % sample at 12 Hz - every 100th sample | ||

- | |||

- | subplot(131) | ||

- | plot(tvec1,signal1); | ||

- | hold on; | ||

- | stem(tvec2,signal2,'.g','MarkerSize',20); | ||

- | title('without anti-aliasing filter'); | ||

- | </code> | ||

- | |||

- | As you can see in the resulting plot, the sampled data (green dots) isn't a clean 3Hz sine wave as we would hope to get -- it is contaminated by an alias of the 10Hz component in the underlying signal. Thus, our naive sampling method of taking every 100th sample is dangerous! Here is how to do it properly: | ||

- | |||

- | <code matlab> | ||

- | % sample at 12 Hz with different method | ||

- | signal2d = decimate(signal1,dt); | ||

- | |||

- | subplot(132) | ||

- | plot(tvec1,signal1a,'b--'); | ||

- | hold on; | ||

- | stem(tvec1d,signal2d,'.g','MarkerSize',20); | ||

- | xlabel('time (s)'); ylabel('y'); | ||

- | title('with anti-aliasing filter'); | ||

- | |||

- | subplot(133) | ||

- | plot(tvec2,signal2-signal2d,'r','LineWidth',2); | ||

- | title('difference'); | ||

- | </code> | ||

- | |||

- | You should get: | ||

- | |||

- | {{ :analysis:course-w16:alias3.png?nolink&600 |}} | ||

- | |||

- | Note that after using ''decimate()'' instead of directly taking each 100th sample, we recover a relatively clean 3Hz sine (except for some edge effects; more on that in later modules). You can see the difference between the two sampling methods are substantial! | ||

- | |||

- | The way ''decimate()'' works is that it first applies a filter to the data, removing any frequencies that could cause aliases (i.e. anything with a frequency of at least half the new sampling frequency). We will explore filters in more detail in Module 6. | ||

- | |||

- | Data acquisition systems use the same principle. If you look in the settings of your system, you should be able to find the details of the filter applied to the data by default. It should have an upper cut-off frequency set to prevent aliasing. For instance, for some Neuralynx systems, all data above 9000 Hz is filtered out. | ||

- | |||

- | โ Suppose you acquired some nice neural data at 2 kHz, and now you decide to analyze frequencies in the 50-100 Hz range. To speed up your analysis and save space, you want to downsample your data to 500 Hz (a good choice, more than adequately above the Nyquist frequency). What MATLAB command would you use to downsample this data, and why? (Hint: what is wrong with the ''downsample()'' function you may be tempted to use?) | ||

- | ==== Reconstructing a signal from sampled data ==== | ||

- | |||

- | ==== Detailed examination of Neuralynx time series data ==== |

analysis/course-w16/week4.txt ยท Last modified: 2018/07/07 10:19 (external edit)

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International