the labeler that started by guessing
from twenty-one spectral measurements to listening — how the labeler learned to hear.
logged
the first version of the labeler couldn’t hear.
what it did was measure. spectral centroid. zero-crossing rate. RMS energy. the shape of the sound on a graph, not the sound itself. you’d give it a kick drum and a snare and it would tell you which one had more high-frequency energy. that’s a real thing it could tell you. it just wasn’t a thing anyone needed to know.
twenty-one of those measurements, fed to a classifier, got us about seventy-three percent accuracy across six classes. kicks, snares, hats, claps, percs, atmospheres. the classifier was honest about its confusions. it confused some hats with some snares. it confused most claps with some hats. inside each class the variation was wider than the gap between classes. measurements alone weren’t enough.
the gap closed when we stopped measuring and started listening. a model called CLAP — contrastive language-audio pretraining — produces embeddings that are something closer to what the sound is than what the spectrum looks like. swapping in CLAP embeddings as features pushed the classifier from seventy-three percent on six classes to eighty-seven percent on nine. kicks, snares, hats, claps, percs, atmospheres, vocals, fx, loops. the new misses look like ones a human would also second-guess.
there’s a step here that didn’t work and i want to name it. between the spectral version and the CLAP version there were three weeks of trying to add more spectral features. mel-frequency cepstral coefficients. chroma. spectral contrast. the accuracy curve went up by maybe two percent and then flatlined. you could see the ceiling — twenty-one features, six classes, this is what they buy you. the lesson wasn’t try more measurements. the lesson was the measurement is wrong. the spectrum doesn’t know what the sound is, and stacking measurements doesn’t change that.
what’s next: the labels are correct, on average, but the gap between correct and useful in a session is still wide. you don’t search for “vocal” at 2 a.m. you search for “the warm tape one with the breath in it.” that’s the thing the labels don’t yet know. so the next pass is conditioning labels on the things the user actually says when they’re looking for something. correction-as-data. the system gets better when you tell it what you meant.
the seris loom workstation is the place this all lands. the labeler is the listener. the loom is the hand.
— frank