Marcelo Cicconet

Visual Pitch Class Profile: A Video-Based Method for Real-Time Guitar Chord Identification

We describe here a method for real-time guitar chord identification using only Computer Vision methods. It's analogous to the state-of-the-art audio-based method which uses an audio feature (called Pitch Class Profile, PCP for short) with a supervised Machine Learning algorithm. We keep the Machine Learning part and replace the PCP by a visual analogous, what we call Visual PCP (or VPCP).

We are using an infrared camera to capture the scene, which is properly illuminated with infrared light. The following picture shows the camera surrounded by 4 infrared light sources.

Special markers are attached to the guitar in order to easily locate the instrument. Such markers are made with a material which has good reflexive properties.

For the fingers, special reflexive gloves dress the middle phalanges.

So, after a thresholding operation, this is what we get:

Using the contour detection algorithm and contour data structure provided by OpenCV, guitar and finger markers can be separated.

With 4 known points in the scene, and with the knowledge about its relations with the guitar geometry, a projective transformation can be applied in order to put the guitar where we want in the image and identify the Region of Interest.

The projective transformation is applied to the north-most extreme of finger roads in order to roughly locate the fingertips in guitar-fretboard coordinates.

The chord a musician plays is viewed by the system as an eight-dimensional vector composed by the coordinates (after projective transformation) of the four fingertips, from the little to the index finger. We call this eight-dimensional vector the Visual Pitch Class Profile (VPCP).

The proposed algorithm for real-time guitar chord detection has two phases. In the first (the training phase), the musician chooses the chords that must be identified and takes some samples from each one of them, where by sample we mean the VPCP. In the second (the identification phase), the system receives the vector corresponding to the chord to be identified and classifies it using the K Nearest Neighbor algorithm.

Seeking quantitative comparisons, we take 100 samples from each one of the 14 major and minor chords in the keys of C, D, E, F, G, A and B, choosing just one shape per chord (in the guitar there are many realizations of the same chord). The video samples were taken by fixing a given chord and, while moving a little bit the guitar, waiting until 100 samples were saved. For the audio samples, for each chord we recorded nearly 10 seconds of a track consisting of strumming in some rhythm keeping fixed the chord. The audio data was then pre-processed in order to remove parts corresponding to strumming (where there is high noise). Then, at regular intervals of about 12 milliseconds an audio chunk of about 45 milliseconds was processed to get its Pitch Class Profile, a 12-dimensional audio feature also known as Chroma Vector, which is described here.

These audio and video samples tend to form clusters in the 12- ant 8-dimensional space, respectively. In the following figures we present an analysis of the audio and video sample clusters. A square (respectively, a triangle) represent the average (respectively, the maximum) distance between the class samples and the class mean vector. The asterisk represent the distance between the cluster mean vector and the nearest cluster mean vector. Note that the clusters of video samples are better defined relatively to those from audio samples.

Regarding classification performance, both methods behaved similarly in the tests we have conduced. The difference is that the audio-based algorithm is sensitive to the noise caused by strumming, while the video-based method don't care about it. This is illustrated in the following figures, where the same chord sequence (played twice) was performed and analyzed by the two methods, using 20 Nearest Neighbors for classification. It can also be seen in the figures that both algorithms have problems with chord transitions.

This work was realized as part of a Computer Vision course I did at Puc-Rio, with professor Marcelo Gattass. I'd like to aknowledge IMPA's Visgraf Lab for providing the hardware.