Classifying and Visualizing Musical Pitch with K-means Clustering
The Galvanize data science curriculum includes a collection of machine learning topics popular among data scientists in the tech industry, but the skills students learn at Galvanize are not limited to only the most popular tech industry applications. For example, audio signal and musical analysis is a less frequently discussed but interesting application of the machine learning concepts taught in Galvanize’s Data Science Intensive. Using topics from Galvanize’s curriculum, this tutorial will demonstrate how to classify and visualize musical pitches from recordings using k-means clustering, implemented with NumPy/SciPy, Scikit-learn, and Plotly.
What is k-means Clustering?
k-means clustering is a popular technique for identifying groups of related items in an unlabeled data set. Given any number k, the algorithm will divide a dataset into k groups such that each item’s distance from the center of its group is minimized. k-means can be used for a wide rage of applications, such as identifying the efficient placement of cell phone towers or selecting the sizes of clothing a manufacturer should produce. As this tutorial will show, k-means can be used to group audio segments by pitch.
A Brief Primer on Musical Pitch
A musical note is a collection of superimposed sine waves with different frequencies, and identifying the pitch of a note requires identifying the frequencies of the most aurally salient of those sine waves.
The simplest musical note contains only one sine wave:
Plotting the “power spectrum,” the magnitude of each component frequency, reveals a single frequency from the above waveform:
Sounds produced by typical musical instruments comprise many component sine waves, and as a result they sound more complex than the pure sine wave shown above. The waveform of the same note (E3) played by a guitar looks and sounds like this:
Plotting its power spectrum reveals a much larger collection of component frequencies:
k-means can use the power spectra of sample audio segments to group the segments by pitch. Given a collection of power spectra with n different frequencies, k-means will group the sample spectra so that the sum of Euclidean distances between each spectrum and the center of its group is minimized in n-dimensional space.
Creating a Dataset from a Recording Using NumPy/SciPy
This tutorial will use a sample recording of 3 distinct pitches, each played for exactly 2 seconds on a guitar.
Converting a .wav file into a NumPy array is easy using SciPy’s wavfile module.
import scipy.io.wavfile as wav filename = 'Guitar - Major Chord - E Gsharp B.wav' # wav.read returns the sample_rate and a numpy array containing each audio sample from the .wav file sample_rate, recording = wav.read(filename)
The recording should be split into short segments, so that each segment’s pitch can be classified independently.
def split_recording(recording, segment_length, sample_rate): segments =  index = 0 while index < len(recording): segment = recording[index:index + segment_length*sample_rate] segments.append(segment) index += segment_length*sample_rate return segments segment_length = .5 # length in seconds segments = split_recording(recording, segment_length, sample_rate)
The power spectrum of each segment can be obtained by applying the Fourier transform, which converts the waveform data from the time domain to the frequency domain. The code below demonstrates how to use NumPy’s Fourier transform module.
def calculate_normalized_power_spectrum(recording, sample_rate): # np.fft.fft returns the discrete fourier transform of the recording fft = np.fft.fft(recording) number_of_samples = len(recording) # sample_length is the length of each sample in seconds sample_length = 1./sample_rate # fftfreq is a convenience function which returns the list of frequencies measured by the fft frequencies = np.fft.fftfreq(number_of_samples, sample_length) positive_frequency_indices = np.where(frequencies>0) # positive frequences returned by the fft frequencies = frequencies[positive_frequency_indices] # magnitudes of each positive frequency in the recording magnitudes = abs(fft[positive_frequency_indices]) # some segments are louder than others, so normalize each segment magnitudes = magnitudes / np.linalg.norm(magnitudes) return frequencies, magnitudes
Some helper functions will create an empty NumPy array and fill it with with our sample power spectra.
def create_power_spectra_array(segment_length, sample_rate): number_of_samples_per_segment = int(segment_length * sample_rate) time_per_sample = 1./sample_rate frequencies = np.fft.fftfreq(number_of_samples_per_segment, time_per_sample) positive_frequencies = frequencies[frequencies>0] power_spectra_array = np.empty((0, len(positive_frequencies))) return power_spectra_array def fill_power_spectra_array(splits, power_spectra_array, fs): filled_array = power_spectra_array for segment in splits: freqs, mags = calculate_normalized_power_spectrum(segment, fs) filled_array = np.vstack((filled_array, mags)) return filled_array power_spectra_array = create_power_spectra_array(segment_length,sample_rate) power_spectra_array = fill_power_spectra_array(segments, power_spectra_array, sample_rate)
“power_spectra_array” is our training dataset, containing a power spectrum for each 1/2 second segment of the recording.
Performing k-means with Scikit-learn
Scikit-learn has an easy-to-use implementation of k-means. Our audio sample contains 3 distinct pitches, so set k equal to 3.
from sklearn.cluster import KMeans kmeans = KMeans(3, max_iter = 1000, n_init = 100) kmeans.fit_transform(power_spectra_array) predictions = kmeans.predict(power_spectra_array)
“predictions” is a Python array containing the group label (an arbitrary integer) for each of the 12 audio segments.
print predictions => [2 2 2 2 0 0 0 0 1 1 1 1]
This array shows that consecutive audio segments are being correctly grouped together as one would expect from listening to the recording.
Visualizing the Results with Plotly
To better understand the predictions, plot the power spectrum of each sample, color-coded by thek-means classification.
# find x-values for plot (frequencies) number_of_samples = int(segment_length*sample_rate) sample_length = 1./sample_rate frequencies = np.fft.fftfreq(number_of_samples, sample_length) # create plot traces =  for pitch_id, color in enumerate(['red','blue','green']): for power_spectrum in power_spectra_array[predictions == pitch_id]: trace = Scatter(x=frequencies[0:500], y=power_spectrum[0:500], mode='lines', showlegend=False, line=Line(shape='linear', color=color, opacity = .01, width = 1)) traces.append(trace) layout = Layout(xaxis=XAxis(title='Frequency (Hz)'), yaxis=YAxis(title = 'Amplitude (normalized)'), title = 'Power Spectra of Sample Audio Segments') data_to_plot = Data(traces) fig = Figure(data=data_to_plot, layout=layout) # py.iplot plots inline using IPython Notebook py.iplot(fig, filename = 'K-Means Classification of Power Spectrum')
Each thin colored line in the plot below represents the power spectrum of the 12 audio segments produced from the sample .wav file. The lines are color-coded based on the k-means prediction of the segment’s pitch. The blue, green, and red spectra have peaks at 82.41 Hz (E), 103.83 Hz (G#), and 123.47 Hz (B), respectively, which are the notes in the sample recording. The strongest frequencies in the sample recording are the low frequencies, so only the lowest 500 frequencies measured by the FFT are included in the plot below.
The natural clustering is evident from plotting the amplitudes of 2 of the strongest overtones shared between the 3 sample pitches.
Learn More at Galvanize!
k-means is one of many machine learning topics taught in Galvanize’s Data Science Intensive program. If you found this interesting, you can learn more here.