Smart Speaker: Implementing Far-Field Microphone Array for Bespoke Audio Experience

About the Client

A company that develops IoT sensor networks for various industries. Their products focus on improving communication and efficiency in healthcare, warehouse management, and corporate environments, where clear and reliable data exchange is needed.
Customer
Confidential
Location
USA
Industry
IoT

Company’s Request

The client required a solution for a ceiling-mounted conferencing speaker that could effectively capture audio from multiple speakers across a room.

Technology Set

XMOS XVF3000 series DSP
Handled far-field voice processing, echo cancellation, and noise suppression.
Far-field microphone array
Captured voices from any point in the room, isolating them from the speaker's output.
PDM Microphones
Digital microphones used for direct audio signal capture without analog components.
VocalFusion software
Managed echo cancellation, noise reduction, and beamforming for clear audio processing.
I2S and PDM Clocking
Used for precise microphone sampling and synchronized audio data transfer.
Altium Designer
Used for designing the printed circuit board (PCB) layout and integrating the microphones with the speaker system.
Cirrus Logic CS43L21 DAC
Selected for handling audio sample rates of 16K and 48K, ensuring smooth playback and audio compatibility.
Si53xx Clock Generator
Provided precise clocking for audio processing at different sample rates.
Python + XMOS SDKs
Used for scripting and automating testing environments, along with debugging tasks.
C/C++
Employed for low-level firmware development and real-time audio processing on the XMOS platform.

The challenge was to develop a ceiling-mounted smart speaker that functioned as both a speaker and microphone. The issue was the proximity between the speaker and microphones, about 5 cm, causing the speaker’s output to overpower distant sounds like human voices. This required using a far-field microphone array.

We needed an off-the-shelf implementation to put it on the board and chose the XMOS XVF3000 series DSP chip. This chip was designed for digital audio and came with far-field processing software. The chip used the XCORE processor, a multi-core architecture with 16 cores, which is different from traditional microcontrollers. It lacked hardware peripherals, so everything, including external connections like I2C and USB, had to be handled by software. 

We selected PDM microphones (Pulse Density Modulation) because they are digital and don’t require additional analog components like amplifiers. 

These microphones worked well with the XCORE chip for processing audio. However, they needed a high clock rate (3MHz), which required precise digital signal processing to recover the audio.

PDM in practice. Top trace clock, bottom trace PDM data

PDM microphones needed to be sampled at the same time, as any delay between them could affect the audio processing. We addressed this by configuring the clock to maintain the correct phase between microphones, ensuring proper audio capture.

With the microphones working, we implemented Acoustic Echo Cancellation (AEC) and Noise Suppression. The AEC filtered out sounds from the speaker, while noise suppression cleaned up background sounds. The VocalFusion software handled this, but we fine-tuned it to match room acoustics and improve voice clarity.

While the final hardware was being developed, we built a 3D-printed prototype to test the microphone array. 

This allowed us to check the positioning and performance of the microphones. A small headphone speaker was used as a stand-in for the larger speaker that would be in the final product. 

Despite limitations in the devkit’s DAC, the prototype helped us refine the system before production.

The VocalFusion system processed audio at 16K sample rates, but most modern audio systems work with 44.1K or 48K. We needed a DAC that could handle both, which led us to choose the Cirrus Logic CS43L21 DAC, as it could manage both sample rates without issues. We also explored up-sampling 16K audio to 48K for compatibility.

The I2S protocol used in the system required precise clocking. The master clock had to be an exact multiple of the sample rate to avoid failures. We solved this using a Si53xx series clock generator from Silicon Labs, which allowed us to adjust the clock ratios for seamless audio processing.

Next, we configured the microphone array geometry. The microphones had to be positioned and phased correctly to ensure that far-field processing worked. Phasing is needed to detect the time differences between sounds picked up by each microphone.

After multiple iterations, we fine-tuned the array so the algorithms could filter out background noise and capture voices accurately.

Integrating the VocalFusion software with the hardware was tricky, as it was a closed-source library. We had to adjust many parameters to optimize AEC, beamforming, and noise suppression. 

After testing, we had to make the system work in real-world environments. We fine-tuned AEC to adapt to different room acoustics and adaptive noise suppression to prioritize human voices in noisy spaces. We also implemented power management to handle the XCORE chip’s high performance without overheating.

 

Value Delivered

Better communication experience
The solution made it possible for people in a conference room to be heard clearly, no matter where they were sitting. This improved overall communication during meetings and reduced the need to repeat information.
Cost savings in hardware
By choosing digital microphones and the XMOS chip, we simplified the design and reduced the number of components needed. This lowered the overall production cost for the client without sacrificing quality.
Increased product reliability
The smart speaker could handle background noise and echo automatically, making it dependable in different environments. Users get clear sound even in noisy or echo-filled rooms, making the product more professional and appealing to buyers.
Faster time to market
We used ready-made components and a quick prototype process to speed up development. This helped the client launch the product sooner, giving them a competitive edge in the market.
Flexibility for future products
The solution was built in a way that it could be adapted easily for future products or upgrades. This gave the client a flexible platform for scaling their product line without starting from scratch.