Filtering out breath noise in real-time

I have a requirement to create something that filters out breath noise from a diver’s audio stream in real-time. (Think “getting rid of Darth Vader’s breath noise”.) I think we can get by with a 300 - 3000 Hz vocal frequency range, which means a ~6000 Hz sampling rate or 13-bits. (12 bits = 4096 Fmax, 13-bits = 8192 Fmax)

We probably need an FFT function that can look for a burst of broad-spectrum white noise and either attenuate the buffers by 60 dB or just duck them entirely. We need to turn this into a digital audio stream that’s merged into a video feed, so it may not be necessary to convert it back to analog. The audio and video do not need to be tightly synchronized, so something up to 250 msec delay in the audio stream is acceptable, although the lower the better.

Maybe a Tiny 2040 with an ADC module can handle that?

I’m mainly a software jockey with a ham license, although my math skills don’t extend to FFT stuff.

Is anybody familiar enough with what’s needed to offer any advice?