A DIY Guide To Separating Sounds From Their Backgrounds

Unpicking audio elements from an already-mixed file is the holy grail of audio production. Joe Albano looks at the state of play, and what's currently possible in this field.  

Ever since digital audio editing reached the point where it allowed for the kinds of adjustments - like time and pitch shifting and de-clipping an audio file - that were only dreamt of in the past, the holy grail of digital audio editors has been to be able to fully and cleanly separate individual instruments and voices already mixed down and embedded in a stereo audiofile. 

The holy grail of audio separation:

Well, we’re not all the way there yet, although current technology has come tantalizingly close, and some limited degree of separation is possible even at this point, though not reliably and certainly not artifact-free. It seems we’ll have to wait a little longer to enjoy the ability to routinely pull apart mixes (for whatever nefarious purposes we may want to do this). But what can be done today—just how far can you reasonably expect to get in the quest to remove or extract individual elements from a mix? Here’s a brief look at the current state of affairs..

M-S to the rescue

One method for trying to isolate particular elements of a stereo mix is, ironically, one of the oldest - M-S, or Mid-Side processing. Mid-Side processing is traditionally a stereo miking technique that was widely employed back in the days of the transition between mono and stereo (and is still used today). 

An M-S stereo recording utilizes two mics run through a matrix - one directional mic points at the source - let’s assume a group on stage - the Mid; a second, coincident Figure-8 mic points left & right, with its null aimed at the source - the Side(s). After running through the matrix, which duplicates and polarity-flips a copy of the Side signal, the resulting stereo recording consists of a mono Mid channel and a stereo Side channel. In mono playback, the Sides cancel, leaving a pure mono recording free of the kind of phase artifacts that might plague other stereo files. Normal L-R stereo recordings can be converted to Mid-Side by running the Left and Right channels through the same matrix. This can easily be done in any DAW, though plenty of plug-ins offer the feature at the push of a button.

Two mics set up for recording M-S stereo

Two mics set up for recording M-S stereo

If a stereo mix is in Mid-Side format, then pulling the Mid channel all the way down will remove everything panned dead center in the mix. This can effectively remove a vocal from a mix (sometimes!), although with a caveat (and it’s a big one) - you’ll also lose everything else panned dead center, which typically may include kick and snare drum, and bass. Still, the technique can effectively strip away a vocal to some degree, with some recordings. This is the method used by many of those “Vocal Eliminator” products that used to be (still are?) marketed to DJs and Karaoke fans. With a relatively dry mix that has the drums recorded in natural stereo (where the kick and snare are often a little off-center), I’ve heard this work surprisingly well, completely eliminating the vocal in a few cases, but more often than not, the results are nowhere near that clean.

One reason is that most recordings have a fair bit of ambience and reverb on the drums and vocals, so even if you succeed in removing a centered vocal, you’re likely to hear a ghostly remainder, the vocal reverb, which, like all reverb signals, is decorrelated - that means the reverberation is different in the Left and Right channels, and so it’s really part of the Side(s) signal.

Now, depending on the panning choices made by the mixer, you may have some luck extracting wide-panned parts via the M-S technique. If someone has, say, doubled a rhythm guitar part and panned the original and delayed copy hard left and right, then killing the center Mid channel may leave a reasonably well-isolated stereo doubled part. However, if it’s a busy mix, there’ll likely be other stuff in the Sides as well, not to mention more ghostly ambience of elements in the center. But in a simpler, somewhat drier mix, M-S techniques might work to a usable degree. As always with M-S tricks, Your Mileage May Vary. 


But what about all that high-tech software that allows for all that amazing editing capability I mentioned up top: can’t those algorithms manage to isolate individual bits in a busy audiofile? Well, yes, but again, not as neatly and cleanly as in the dream. But there is a lot that can be done.

People often look to Melodyne for this kind of thing, due to their (so far) unique DNA feature, which allows for time and pitch editing of individual notes in a polyphonic recording (all the other pitch editors are monophonic-only).  

Melodyne’s DNA polyphonic pitch editing feature

Melodyne’s DNA polyphonic pitch editing feature

But Celemony goes out of their way to point out that separating different instruments mixed together is not possible - only notes from the same instrument can be individually tweaked. The DNA algorithm performs Fourier Analysis on the file, identifying the various harmonics and associating them with the different pitches of the instrument in the recording. Since most instruments can’t play more than one of the same note at the same time, it works; but if a few instruments were mixed together, and played the same pitch simultaneously, DNA has no ability to separate them (at least, not at this point).

That said, I did succeed somewhat in a little experiment where I tried to separate a voice and guitar—it sort of worked, with a bit (actually a fair bit) of tweaking, but when they did play overlapping identical pitches, I got bizarre artifacts - a vocal note with a guitar’s tone, or vice versa - entertaining but not usable in any practical way. Still, you can see the potential. Remember “they” also said that what DNA does wasn’t possible...until it was. 

Spectral tech

Most of the real action, at least the most promising action in this area is in Spectral Processing. This utilizes the most up-to-date take on the same underlying techniques - Fourier Analysis, etc - and identifies time, amplitude, and frequency in a 3D or 2D graph. Then algorithms use this information to identify and either remove or extract individual components of a complex audio file, or they allow the user to visually ID specific elements, and graphically select something for processing. Like with the DNA limitation, even these displays can’t really adequately separate multiple simultaneous notes enough to extract different musical parts—their harmonics may be visible, but they overlap, preventing effective isolation. Currently, Spectral Processing excels at separating different kinds of sounds, like non-tonal (noise) vs tonal (music) in a recording. 

As manufacturers have gotten more sophisticated with this, they’ve  begun to apply machine-learning and other advanced internal analysis methods, which try to learn the sonic signature of a specific sound, and then identify it in a complex wave even if the user wouldn’t be able to do so visually. iZotope’s latest iteration of its RX audio repair software includes a number of modules that do a surprisingly effective job of isolating a voice and extracting, removing, or at the very least rebalancing it from within a busy audiofile. 

Some of RX6’s tools for extracting or removing voice and other audio signals from a complex audiofile

Some of RX6’s tools for extracting or removing voice and other audio signals from a complex audiofile

Several automatically remove unwanted artifacts from vocal recordings (De-noise, De-rustle, De-wind), record scratch from old vinyl (De-click, De-crackle), and even excessive ambience (De-reverb); their Dialog Isolate can sometimes succeed in extracting a voice from a busy background; and I’ve used their De-bleed, Deconstruct and Spectral Repair modules to remove fairly prominent leakage (headphone bleed, creaky chair, distant siren) from various recordings. Clearly, this technology has the potential to dig in a little deeper and learn to identify different instruments, but at this point that’s still a hit-or-miss proposition.

Future tech

Now, there are a few companies promising that they’ve managed to achieve the goal of voice and instrument isolation/extraction. However, as far as I can tell, the results are still a bit spotty. Some demos are, in fact, quite impressive, but others exhibit all the familiar artifacts usually encountered with this type of processing at its current level. But when it does work it’s amazing, and it does seem like this capability could be just over the horizon. 

So while I wouldn’t hold my breath, I would keep an eye open for emerging developments in the area of audio extraction. Like all that other advanced processing we currently take for granted and that we were at one point promised “couldn’t be done", full audio separation could very well be appearing pretty soon in a DAW near you.

Learn more about audio repair and forensics in the Ask Audio Academy here.

Joe is a musician, engineer, and producer in NYC. Over the years, as a small studio operator and freelance engineer, he's made recordings of all types from music & album production to v/o & post. He's also taught all aspects of recording and music technology at several NY audio schools, and has been writing articles for Recording magaz... Read More


Want to join the discussion?

Create an account or login to get started!