wiki

~~DISCUSSION~~ ==== Module 4: spike sorting ==== Goals: * Understand the basic principles of the spike sorting process * Become familiar with the workflow for spike sorting using %%KlustaKwik%% and %%MClust%% * Use tools for the quantification of cluster quality and appropriate propagation through a larger data analysis pipeline Resources: * (a quick read) A [[http://www.scholarpedia.org/article/Spike_sorting|nice introduction]] to the spike sorting problem and some popular approaches (from Rodrigo Quian Quiroga, University of Leicester, UK) * (background) A [[http://www.jneurosci.org/content/31/24/8699.long|review paper]] (Hill et al. J Neurosci 2011) with comprehensive treatment of topics in spike sorting * (background) A [[http://www.ncbi.nlm.nih.gov/pubmed/22023727| perspective towards the future]] by Einevoll et al. (2011), "Towards reliable spike-train recordings from thousands of neurons with multielectrodes" * (reference) [[http://redishlab.neuroscience.umn.edu/MClust/MClust.html|official release]] of %%MClust%% ==== Introductory remarks ==== Spike sorting is the process of assigning experimentally recorded voltage waveforms to putative single neurons. Extracellular recordings generally contain contributions from multiple neurons, as well as noise, and therefore cannot be directly converted to spike trains. The spike sorting process relies on the idea that extracellularly recorded waveforms are a function of (1) a neuron's morphological (size, shape) and electrical properties (channel distributions, states), and (2) the location of the recording electrode relative to the neuron (see [[http://www.nature.com/nrn/journal/v13/n6/abs/nrn3241.html | Buzsaki et al. 2012]]). Thus, waveforms generated by different neurons are expected to result in differently sized and shaped waveforms on a given electrode. If "features" extracted from these waveforms, such as their peak voltage, width, or energy (area under the curve squared) form distinct clouds in feature space, these clouds are inferred to correspond to distinct, isolated neurons. Spike sorting can be a laborious and frustrating process. It is surprising that despite decades of research, no satisfactory automated method has been found! One reason for this is that correct spike sorting needs to rely on just the right amount of domain knowledge of what is likely and unlikely: refractory periods, electrode instability, adaptation of action potential amplitude, and variation in where on the dendritic tree synaptic inputs arrive all affect the shape of clusters in ways that current automated methods do not capture well. Yet, there is definitely satisfaction in spike sorting electrodes with plentiful clusters, in anticipation of the beautiful data set that will result! ==== Overview of the spike sorting process ==== A conceptual overview of the spike sorting process for tetrode recordings is shown below (from Einevoll et al. 2011): {{ :analysis:nsb2014:overview.jpg?800 |}} This image illustrates the process from extracellular potentials recorded by a tetrode (**b**) to spike trains (**h**). For our purposes, the most important steps are: - filtering of the raw data in the spike band (**c**) - detection of potential spikes (**d**; typically done by simply setting a threshold) - aligning potential spikes by the location of their peak (**e**) - extracting "features" such as the peak voltages or principal components (**f**) - clustering features into putative neurons (**g**) - obtaining the spike times for each cluster (**i**) Data acquisition systems (such as our Neuralynx Digital Lynx) can be configured to perform steps **b**-**e** on-line during recording, but recording only the raw data (**b**) is also an option. Recording the raw data affords opportunities to optimize steps **c**-**e** but is more storage- and time-intensive. Assuming you collect aligned spike data (.ntt files for Neuralynx), further analysis for these data proceeds as follows: <graphviz> digraph G { node [style=rectangle]; rd1 [label="data .ntt files"]; rd1 -> ff; subgraph cluster_0 { ff [label="waveform feature .fet files"]; ff -> kkclu [label="KlustaKwik"]; label = "RunClustBatch()"; } kkclu [label="KlustaKwik .clu files"]; kkclu -> kdw; rd1 -> kdw; subgraph cluster_1 { kdw [label = "KlustaKwik Decision Window"]; mc [label = "Manual touch-up"]; ccv [label = "Cluster quality verification"]; kdw -> mc -> ccv; ccv -> mc [label="until acceptable"]; ccv -> rc; rc [label = "Rated clusters"]; label = "MClust"; } rc -> tfiles [label=" Write Files"]; tfiles [label = "spike train .t files"]; wv [label = "waveform WV files"]; cq [label = "ClusterQual CQ files"]; rc -> wv; rc -> cq; } </graphviz> (Note that the resulting files are part of a **promoted** data set, discussed in [[analysis:nsb2015:week2|Module 2]].) The next section provides more detailed information on how to implement each step of the above pipeline. ==== Spike sorting with MClust ==== === Preparation of raw data files === The rest of this module assumes you are using the ''R042-2013-08-14'' example data set. If you want to work with your own data, create a local copy of your Incoming or InProcess data (see [[analysis:nsb2015:week2|Module 2]] for a reminder of what these stages are) before proceeding. If you recorded ''.ntt'' files that you want to spike-sort (or ''.nst'' in the case of stereotrodes), rename them into the standard format (Rxxx-yyyy-mm-dd-TTnn.ntt for rats, Mxxx-yyyy-dd-TTnn.ntt for mice; note that you should use 3 digits for the number, e.g. 001 instead of 1!). You can do this manually or write a simple batch script to do the renaming. This has already been done for the example data set. === Obtaining MClust and adding it to your path === If you have already cloned the ''nsb2016'' repository on %%GitHub%% (see [[analysis:nsb2016:week1|Module 1]] for instructions) you already have the %%MClust%% data files. However, some of the %%MClust%% .m files conflict with those in the main codebase. Therefore, you need a different path for using %%MClust%%. Create a new shortcut as follows: <code matlab> restoredefaultpath; clear classes; % start with a clean slate cd('D:\My_Documents\GitHub\nsb2016\code-matlab\toolboxes\MClust-4.3'); % replace with your own path p = genpath(pwd); % create list of all folders from here addpath(p); </code> After running this shortcut, you should be able to type ''MClust'' and have the main window appear. If you do, make sure to close %%MClust%% again before proceeding to the next step. === Running KlustaKwik === %%KlustaKwik%% is an unsupervised clustering algorithm that sorts the waveform data into clusters. This makes spike sorting much faster compared to doing everything manually from scratch! %%RunClustBatch()%% is a script that automatically runs %%KlustaKwik%% on all tetrode files in the current working directory. It has some default settings, defined in the ''.m'' file that you can look at, such as the maximum number of clusters, which waveform features to use, et cetera. For the sample dataset we will use the defaults, except for one thing: the third wire of the tetrode was disabled, so we want to exclude it from processing. This we do as follows: <code matlab> RunClustBatch('channelValidity',[1 1 0 1]); </code> If you have your own data that you want to spike sort which has all channels of a tetrode working, you can simply do ''RunClustBatch'' without the optional channelValidity argument. Remember to run this inside your data folder! Note that by default, %%RunClustBatch%% calls %%KlustaKwik.exe%%, included by default in the %%MClust%% distribution. This is fine for Windows machines, but if you are running something else you will need to visit %%KlustaKwik%%'s home on %%SourceForge%%. If you get a message that %%MClust%% is already running, do a ''clear all''. As shown in the above diagram, %%RunClustBatch%% generates feature files (.fet) containing peak, energy, etc. values for each waveform, and cluster files (.clu) containing the cluster assignments. These are stored in the %%FD%% folder. The next step needs these .clu files to work. === Loading the data into MClust === Type ''MClust'' to obtain the main window as shown below: (Note, the current version of ''MClust'' you are using may look slightly different, but the main functions are the same.) {{ :analysis:nsb2014:mclust_main.png?500 |}} Because we will be loading a tetrode which has the third lead disabled, uncheck the "Ch 3" box. Then, click "Create/Load FD files" and select "R042-2013-08-14-TT13.ntt". The gray rectangle should turn green and show the name of the loaded tetrode file. If you load a tetrode without correct channel validity settings, you may get errors at some later point. To re-load the tetrode correctly, you should exit MClust, do a ''clear all'', and restart, remembering to load with the correct settings this time. You can see why it is a good idea to take screenshots of your Cheetah spike window, so you can see at a glance which settings to use. === Using the KlustaKwik decision window === The goal of the %%KlustaKwik%% decision window is to decide which (pre)clusters you want to keep and which you want to toss. Once you have your preclustered tetrode loaded, click the "Select from KKwik" button in the %%MClust%% main window. You will see some figures open, like this: {{ :analysis:nsb2014:mclust_kk.png?800 |}} (You may have to move and/or resize your windows to get them to look like this.) Apart from the main control window ("Cutting Control Window"), several other figures are provided to show information about the currently selected cluster: * The "Cluster Cutting Window" shows a scatterplot of one feature against another (e.g. Peak 1 vs. Peak 2). Each data point corresponds to one set of spike waveforms across the tetrode leads. Which features are shown can be controlled from the main control window (under "Axes"). As you click through a few feature combinations you should see some distinct clusters of points, forming clusters of spikes from putative single neurons. The currently selected cluster is shown in a different color. * Average waveforms across the four tetrode leads for the currently selected cluster. * An interspike interval (ISI) histogram for the currently selected cluster. Any histogram counts less than 2ms or so indicate poor cluster quality. For each cluster, the information displayed in these windows determines your decision of whether to keep or toss it. The default is to toss; to toggle, click on the "toss" button. The first cluster is selected by default. As you can see from the scatterplot, this is a cluster with a small amount of diffuse spikes that did not fit clearly in any cluster. We can toss this one. To move on to the next cluster, locate the radio button next to "Focus" in the control window and click on the one directly below it. This selects cluster 2. If your %%KlustaKwik%% run returned the same results as mine, you should see that this cluster appears to consist of a tight cloud of spikes, plus some others that are clearly separate: {{ :analysis:nsb2014:mclust_kk_splitcluster.png?600 |}} (Note that this is showing Peak 1against Peak 4.) We want to keep the tight cloud, which we will need to manually separate from the other spikes later. Click on the "2" to select a different color than black for this cluster, and toggle to "keep" this one. Proceed one by one through the other clusters, deciding whether to keep or toss. Assign colors to keeps; I use a specific gray color to indicate questionable clusters, and strong primary colors for ones that seem good. Related clusters can be given similar colors to facilitate manual touch-up later. If you want to merge two (pre)clusters, rename one of them to match the name of the other by clicking on the "KKxx" labels. All clusters with the same name will be merged. Because it is easier to merge than to split clusters, it is often better to instruct %%KlustaKwik%% to "over-cluster" the data to avoid cases like the above example (this example is likely "under-clustered"). The Decision Window also has some other features that are described in the %%MClust%% manual. When you are satisfied, click Exit (Export) to return to the %%MClust%% main window. You will see the black panel in the top left now indicates a certain number of clusters. === Manual touch-up in MClust === Click the %%ManualCut%% button in the main %%MClust%% window to open the manual touch-up process. You will see a list of clusters with the names and colors exported from the %%KlustaKwik%% decision window. In the main cutting window, check the "Redraw Axes" box to make a scatterplot of all clusters appear. By default, this is plotted with small dots, but clusters will be easier to see if we change the marker size to size 2 circles. Do this using the drop-down menus indicated in red below. {{ :analysis:nsb2014:mclust_manual.png?800 |}} Let's first clean up the split cluster we encountered above. To see only this cluster, first click "Hide", and then uncheck the checkbox for KK02. You should now see only this cluster in the scatterplot. To see the waveforms for this cluster, click on the drop-down menu to the right of the KK02 label and select Show Waveforms (you may have to scroll down for this). The slider at the bottom controls the number of waveforms plotted; things are usually easier to see by plotting a small number. In this case, inspecting the waveforms should reveal that there are two clearly distinct waveforms visible on lead 4 (the rightmost subplot). These are unlikely to be from the same neuron. By default the scatterplot shows Peak 1 against Peak 2. Change this to show Peak 1 against Peak 4; this should result in two clearly distinct clusters. Select "SplitSpikesByCvxHull". You should get a cursor (crosshair) in the cluster cutting window. Clicking multiple times around the top cluster to define a convex shape enclosing the spikes. The shape will automatically be completed once you press Enter. You should see that KK02 now contains only the spikes at the top, and a new cluster is appended to the list of clusters (named KK02-split for me) containing the spikes outside the enclosed region: {{ :analysis:nsb2014:mclust_manual_split.png?800 |}} (Note, by the way, that if you still have the waveforms window open, these did not get updated.) By clicking through a few different feature projections and inspecting the new waveforms, it now appears that we have a tight cluster that is no longer split into multiple pieces. The %%01_CheckCluster%% option provides a convenient summary of a cluster's properties, to be reviewed before accepting: {{ :analysis:nsb2014:mclust_check.png?800 |}} Important indicators of cluster quality are: * the waveforms: look for suspect shapes such as triangle or square waves (indicating non-biological artifacts from e.g. movement), "pinch points" where the standard deviation is very small at the peak (this indicates the waveforms only just pass the detection threshold and the cluster is likely "cut off") * the %%ISI%% histogram (below the waveforms): %%ISIs%% below a couple of ms are likely incompatible with typical refractory periods and indicate contamination with noise or spikes from another neuron. (Note, however, that a clean %%ISIh%% does not imply good cluster isolation!) * autocorrelograms (below the %%ISIh%%, see Module 9 for details): can indicate suspect frequency components (e.g. %%60Hz%% electrical line noise) * peaks over time (bottom): should be stable over time. Drift over time indicates instability of the electrode position relative to the neuron and makes it harder to obtain well isolated clusters. * peak plot (top right): should be a single, tight cluster of points. These are small peak vs. peak plots as in the scatterplots of the main cutter. * Mahalanobis distance plot: indicates the distance from the center of the cluster to all points within and outside the cluster. In a good cluster, these groups of points are far apart. Two statistical quantities are shown, computed from these distributions (discussed in the next section). === Cluster quality metrics === Clusters are sets of points in multidimensional feature space. The L-ratio, indicates how "tight" the points in a cluster are, and the isolation distance (ID) indicates how well isolated the points in a cluster are from those outside it. Lower L-ratios and higher %%IDs%% are good. Typical values for including a cluster are L-ratio < 0.1 and ID > 15. However, these statistical measures are not perfect. For instance, a cell that is cut off by the detection threshold can have an artificially high detection threshold. Thus it is helpful to also manually rate each cluster on a scale from 1 (good) to 5 (bad). Do this by renaming each cluster you want to keep and simply having an 1-5 number as the first character. For example, a typical cluster name might be "5, cutoff" or "4, instability". The %%LoadSpikes()%% function can take an optional input argument to only load, for instance, clusters with at least a 4 rating. These ratings should therefore NOT be based solely on the LR and ID. Guidelines for these are: * 1: "gold standard" cluster, extremely unlikely that even a single spike is missing or contaminating the cluster. * 2: very good cluster, a few missing or contaminated spikes, with clear white space separating the cluster from the others. * 3: good cluster, but edges of cluster may be touching another cluster, rendering the boundary somewhat unclear. Some contamination or missing spikes * 4: acceptable cluster, but boundaries are hard to determine, leading to * 5: cluster with defects such as being cut off or majorly unstable. These will generally not be included for analysis, but may be useful for certain questions where spikes are not important (e.g. those involving decoding). By clicking the "Use *._t files" option in the main %%MClust%% window, 5 rated clusters can be automatically saved with a *._t extension so that it is easy to see at a glance how many usable cells a given data folder contains (see the next section on saving). === Writing files === When you are satisfied that you have only clusters left that you want to keep, click Export Clusters. Note that by default, this overwrites the clusters associated with the main %%MClust%% window. The main %%MClust%% window does not automatically update unless you Export! Then, from the main %%MClust%% window, click "Write .t, WV, CQ" which saves a .t file for each cluster containing the spike times (see [[analysis:nsb2014:week2|Module 2]] for more information on data formats) as well as waveform and cluster quality files for later use. Any "Write files" option also saves a .clusters file that allows you to load the cluster definitions. The writing of these files concludes the spike sorting process! === MClust gotchas (please add if you encounter some!) === * When you are done cutting a tetrode, do a Clear Workspace from the main window before loading another. * Configure your taskbar so that you can see what MClust windows are open. Close windows you do not need. * Keep an eye on any error messages that may occur; enabling sound helps with this. If you encounter an error, save your clusters and reload MClust and your saved file. (Do a ''clear all'' before opening a new one.) === Advanced topics === == Cluster types == %%MClust%% supports three different cluster types: * %%CvxHullCluster%%: a cluster defined by convex hull limits only. This is useful when you want a cluster definition to be applied to a new data set: all spikes that fall within the limits are assigned to this cluster. * %%SpikeListCluster%%: a cluster defined only by the spikes (points) that are in it (i.e. no information about limits is saved). No limits are shown or saved. * %%MCCluster%%: a cluster that saves limit information but can also exclude points for other reasons (e.g. by cutting waveforms or removing double-counted spikes)

wiki

User Tools

Site Tools

Sidebar

Page Tools