Voice Activity Detector — transform

Voice Activity Detector. Similar to SoX implementation.

transform_vad(
  sample_rate,
  trigger_level = 7,
  trigger_time = 0.25,
  search_time = 1,
  allowed_gap = 0.25,
  pre_trigger_time = 0,
  boot_time = 0.35,
  noise_up_time = 0.1,
  noise_down_time = 0.01,
  noise_reduction_amount = 1.35,
  measure_freq = 20,
  measure_duration = NULL,
  measure_smooth_time = 0.4,
  hp_filter_freq = 50,
  lp_filter_freq = 6000,
  hp_lifter_freq = 150,
  lp_lifter_freq = 2000
)

Arguments

sample_rate: (int): Sample rate of audio signal.
trigger_level: (float, optional): The measurement level used to trigger activity detection. This may need to be cahnged depending on the noise level, signal level, and other characteristics of the input audio. (Default: 7.0)
trigger_time: (float, optional): The time constant (in seconds) used to help ignore short bursts of sound. (Default: 0.25)
search_time: (float, optional): The amount of audio (in seconds) to search for quieter/shorter bursts of audio to include prior the detected trigger point. (Default: 1.0)
allowed_gap: (float, optional): The allowed gap (in seconds) between quiteter/shorter bursts of audio to include prior to the detected trigger point. (Default: 0.25)
pre_trigger_time: (float, optional): The amount of audio (in seconds) to preserve before the trigger point and any found quieter/shorter bursts. (Default: 0.0)
boot_time: (float, optional) The algorithm (internally) uses adaptive noise estimation/reduction in order to detect the start of the wanted audio. This option sets the time for the initial noise estimate. (Default: 0.35)
noise_up_time: (float, optional) Time constant used by the adaptive noise estimator for when the noise level is increasing. (Default: 0.1)
noise_down_time: (float, optional) Time constant used by the adaptive noise estimator for when the noise level is decreasing. (Default: 0.01)
noise_reduction_amount: (float, optional) Amount of noise reduction to use in the detection algorithm (e.g. 0, 0.5, ...). (Default: 1.35)
measure_freq: (float, optional) Frequency of the algorithm’s processing/measurements. (Default: 20.0)
measure_duration: (float, optional) Measurement duration. (Default: Twice the measurement period; i.e. with overlap.)
measure_smooth_time: (float, optional) Time constant used to smooth spectral measurements. (Default: 0.4)
hp_filter_freq: (float, optional) "Brick-wall" frequency of high-pass filter applied at the input to the detector algorithm. (Default: 50.0)
lp_filter_freq: (float, optional) "Brick-wall" frequency of low-pass filter applied at the input to the detector algorithm. (Default: 6000.0)
hp_lifter_freq: (float, optional) "Brick-wall" frequency of high-pass lifter used in the detector algorithm. (Default: 150.0)
lp_lifter_freq: (float, optional) "Brick-wall" frequency of low-pass lifter used in the detector algorithm. (Default: 2000.0)

Value

torch::nn_module()

Details

Attempts to trim silence and quiet background sounds from the ends of recordings of speech. The algorithm currently uses a simple cepstral power measurement to detect voice, so may be fooled by other things, especially music.

The effect can trim only from the front of the audio, so in order to trim from the back, the reverse effect must also be used.

forward param: waveform (Tensor): Tensor of audio of dimension (..., time)

References

https://sox.sourceforge.net/sox.html