Dr. Ming Ma, Winona State University
Decades of music exists in the world with no way to find the original multitrack recordings that they are composed of, whether it be a lack of access or just being lost to time. Music source separation is the process of using technology to isolate the individual components of a song after they have been fully arranged and mixed together. These components, or stems, can be the vocals, drums, bass, or any other instrument found in the song. By doing complex analysis on the structure of songs to determine which frequencies or waveform patterns belong to particular sources, a convolutional neural network can be trained to cleanly separate these sources from the full mix. Methods to perform this separation include the use of a spectrogram or a waveform. We examine the existing methods used by Spleeter, Demucs, and f90 Wave-U-Net. In the first phase of the experiment we implement three different U-Net architectures used by these methods using both one-dimensional and two-dimensional convolution. By examining the metrics of SDR (Signal Distortion Ratio), SAR (Signal to Artifacts Ratio) and ISR (Source Image to Spatial Distortion Ratio), we were able to determine the best architecture. From this architecture we move to phase two and implement a secondary U-Net to aid in further refining the results.
Henning, Ryan; Choudhry, Abdullah; and Ma, Ming
"Deep Learning Based Music Source Separation,"
SCSU Journal of Student Scholarship: Vol. 1:
2, Article 3.
Available at: https://repository.stcloudstate.edu/joss/vol1/iss2/3