はじめに

前回はドとレだけの判別でしたが今回はド、レ、ミ、ファ、ソ、ラ、シの7音階をクラス分類していきます。

作成したもの

GitHubにおいてます。

詳細

データセット

生成方法 : Garage Band の Steinway Grand Piano
高さ: C0~C5
ベロシティ(音の強弱) : 86, 96, 106, 116 データ数 : 24個ずつ(全168個)
データセットの分け方 : 学習 : チューニング : テスト = 17 : 5 : 2

import random

random.seed(5)

piano_notes = ['do', 're', 'mi', 'fa', 'so', 'ra', 'si']
piano_all_sounds = list(range(24))

piano_train_sounds = random.sample(piano_all_sounds, 17)

set_tune = set(piano_all_sounds) - set(piano_train_sounds)
piano_tune_sounds = random.sample(list(set_tune), 5)

set_test = set(piano_all_sounds) - set(piano_train_sounds) - set(piano_tune_sounds)
piano_test_sounds = random.sample(list(set_test), 2)

print("all_sounds : {}".format(sorted(piano_all_sounds)))
print("train_sounds : {}".format(sorted(piano_train_sounds)))
print("tune_sounds : {}".format(sorted(piano_tune_sounds)))
print("test_sounds : {}".format(sorted(piano_test_sounds)))

前処理

データセットをランダムで抽出固定したいためrandom.seedを使いました。
LibROSAを用いてmfcc(メル周波数ケプストラム係数)を抽出しています。これは音声データをそのまま使うと次元がとても大きくなり処理が大変なため次元を圧縮して使いやすくしようという考え方です。ついでに音声の特徴も出やすいため、mfccを採用しています。

import scipy.io.wavfile as wav
import librosa
from sklearn.svm import SVC
import numpy

def get_mfcc(fname):
    y, sr = librosa.load(fname)
    return librosa.feature.mfcc(y, sr)

if __name__  ==  '__main__':

    piano_note_training = []
    piano_sound_training = []

    for piano_note in piano_notes:
        print('Reading data of {}...'.format(piano_note))
        for piano_sound in piano_train_sounds:
            
            # get mfcc 173次元
            mfcc = get_mfcc('{}/{}{}.wav'.format(piano_note, piano_note, piano_sound))
            piano_sound_training.append(mfcc.T)
            
            label = numpy.full((mfcc.shape[1], ), 
                               piano_notes.index(piano_note), dtype=numpy.int)
            piano_note_training.append(label)
    
    piano_sound_training = numpy.concatenate(piano_sound_training)
    piano_note_training = numpy.concatenate(piano_note_training)
    print('done.\n')

学習＆チューニング＆テスト

この部分は一気にやっています。
理由としては学習の際にSVCのみを使用しているため、gamma値の違いで結果が変わるからです。
なので学習、チューニング、テストを一気に行っています。

gamma_list = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]

for gamma in gamma_list:
    print('\n----- gamma={} -----\n'.format(gamma))
    svc = SVC(gamma = gamma)
    svc.fit(piano_sound_training, piano_note_training)
    print('----- Learning Done -----\n')

    # 正答率
    sounds_num = 0
    correct_sounds = 0
    correct_rate = 0.0

    print("----- tune -----")
    for piano_note in piano_notes:
        for piano_sound in piano_tune_sounds:
            sounds_num += 1
            mfcc = get_mfcc('{}/{}{}.wav'.format(piano_note, piano_note,piano_sound))
            prediction = svc.predict(mfcc.T)
            counts = numpy.bincount(prediction) 
            result = piano_notes[numpy.argmax(counts)] # 音階の判定
            original_title = '{}'.format(piano_note)

            if result == original_title:
                correct_sounds += 1
                
    correct_rate = correct_sounds / sounds_num
    print('{} : correct rate : {}%.'.format(gamma,correct_rate*100))
    
    # 正答率
    sounds_num = 0
    correct_sounds = 0
    correct_rate = 0.0
    
    print("----- test -----")
    for piano_note in piano_notes:
        for piano_sound in piano_test_sounds:
            sounds_num += 1
            mfcc = get_mfcc('{}/{}{}.wav'.format(piano_note, piano_note,piano_sound))
            prediction = svc.predict(mfcc.T)
            counts = numpy.bincount(prediction) 
            result = piano_notes[numpy.argmax(counts)] # 音程の判定
            original_title = '{}'.format(piano_note)

            if result == original_title:
                correct_sounds += 1

    correct_rate = correct_sounds / sounds_num
    print('{} correct rate : {}%\n\n.'.format(gamma, correct_rate*100))

出力結果

----- gamma=0.1 -----

----- Learning Done -----

----- tune -----
0.1 : correct rate : 48.57142857142857%.
----- test -----
0.1 correct rate : 42.857142857142854%

.

----- gamma=0.01 -----

----- Learning Done -----

----- tune -----
0.01 : correct rate : 74.28571428571429%.
----- test -----
0.01 correct rate : 50.0%

.

----- gamma=0.001 -----

----- Learning Done -----

----- tune -----
0.001 : correct rate : 94.28571428571428%.
----- test -----
0.001 correct rate : 100.0%

.

----- gamma=0.0001 -----

----- Learning Done -----

----- tune -----
0.0001 : correct rate : 91.42857142857143%.
----- test -----
0.0001 correct rate : 92.85714285714286%

.

----- gamma=1e-05 -----

----- Learning Done -----

----- tune -----
1e-05 : correct rate : 77.14285714285715%.
----- test -----
1e-05 correct rate : 64.28571428571429%

.

----- gamma=1e-06 -----

----- Learning Done -----

----- tune -----
1e-06 : correct rate : 42.857142857142854%.
----- test -----
1e-06 correct rate : 21.428571428571427%

.

----- gamma=1e-07 -----

----- Learning Done -----

----- tune -----
1e-07 : correct rate : 20.0%.
----- test -----
1e-07 correct rate : 0.0%

感想

gamma=1e-3がとても良い結果だったテストが14個しかないから正答率100%でもゆるして・・・
データを三分割することによってより正確な判定が行えたと思う。
はじめは学習、チューニング、テストを分けてやっていたがそれは良くないと指摘を受けたため一度にすべてのことをするようにした。NNやDNNだとこの方法は使えないためモデルの保存を考えていきたい。

hama-matcha’s blog

思ったことを書き残すモノ

続　ピアノの音判別をしてみた

はじめに

作成したもの

詳細

データセット

前処理

学習＆チューニング＆テスト

感想