Smart Speaker: Implementing Far-Field Microphone Array for Bespoke Audio Experience
About the Client
Company’s Request
Technology Set
The challenge was to develop a ceiling-mounted smart speaker that functioned as both a speaker and microphone. The issue was the proximity between the speaker and microphones, about 5 cm, causing the speaker’s output to overpower distant sounds like human voices. This required using a far-field microphone array.
We needed an off-the-shelf implementation to put it on the board and chose the XMOS XVF3000 series DSP chip. This chip was designed for digital audio and came with far-field processing software. The chip used the XCORE processor, a multi-core architecture with 16 cores, which is different from traditional microcontrollers. It lacked hardware peripherals, so everything, including external connections like I2C and USB, had to be handled by software.
We selected PDM microphones (Pulse Density Modulation) because they are digital and don’t require additional analog components like amplifiers.
These microphones worked well with the XCORE chip for processing audio. However, they needed a high clock rate (3MHz), which required precise digital signal processing to recover the audio.
PDM in practice. Top trace clock, bottom trace PDM data
PDM microphones needed to be sampled at the same time, as any delay between them could affect the audio processing. We addressed this by configuring the clock to maintain the correct phase between microphones, ensuring proper audio capture.
With the microphones working, we implemented Acoustic Echo Cancellation (AEC) and Noise Suppression. The AEC filtered out sounds from the speaker, while noise suppression cleaned up background sounds. The VocalFusion software handled this, but we fine-tuned it to match room acoustics and improve voice clarity.
While the final hardware was being developed, we built a 3D-printed prototype to test the microphone array.
This allowed us to check the positioning and performance of the microphones. A small headphone speaker was used as a stand-in for the larger speaker that would be in the final product.
Despite limitations in the devkit’s DAC, the prototype helped us refine the system before production.
The VocalFusion system processed audio at 16K sample rates, but most modern audio systems work with 44.1K or 48K. We needed a DAC that could handle both, which led us to choose the Cirrus Logic CS43L21 DAC, as it could manage both sample rates without issues. We also explored up-sampling 16K audio to 48K for compatibility.
The I2S protocol used in the system required precise clocking. The master clock had to be an exact multiple of the sample rate to avoid failures. We solved this using a Si53xx series clock generator from Silicon Labs, which allowed us to adjust the clock ratios for seamless audio processing.
Next, we configured the microphone array geometry. The microphones had to be positioned and phased correctly to ensure that far-field processing worked. Phasing is needed to detect the time differences between sounds picked up by each microphone.
After multiple iterations, we fine-tuned the array so the algorithms could filter out background noise and capture voices accurately.
Integrating the VocalFusion software with the hardware was tricky, as it was a closed-source library. We had to adjust many parameters to optimize AEC, beamforming, and noise suppression.
After testing, we had to make the system work in real-world environments. We fine-tuned AEC to adapt to different room acoustics and adaptive noise suppression to prioritize human voices in noisy spaces. We also implemented power management to handle the XCORE chip’s high performance without overheating.