Saturday, February 29, 2020

toolbox :: voice-over

contents
1. voice: a) detection* b) levels and record [alsamixer, audacity] c) cut and polish [goldwave] 3. sync: solution is slowdown video (to about .4) for clap, then work pieces
2. tts4. notes

* typically requires ~/.asoundrc authorship


This post concerns two voiceover methods -- one human, the other using Text-To-Speech (TTS). Both of course have to be synced to video. Hopefully, the post also helps those doing audio w/o video.

1. Voice

input

I currently use a USB connected, non XLR, microphone, the Fifine K053 ($25, USB ID 0c76:161f).

Cleaner options are available at higher price points, eg phantom power supplies and so on to prevent hum, or XLR pre-amping. Either way, what we'd prefer is to see our mic's input levels as we're speaking, and to be able to adjust those levels in real time. Of course, the first step is its detection.

detection (and .asoundrc)

Links: mic detection primer :: asoundrc primer :: Python usage


If using ALSA, we need to know how the system detects and categorizes a mic within an OS. Then we'll put the information in ~/.asoundrc, so it can be used by any sound software. However, if you have the dreaded two sound card situation, which always occurs with HDMI, you've got to straighten that out first, and reboot. The HDMI situation works best when I unplug the mic before booting, and then plug it in after login. Typically, the mic is Device 2, after Device 0 and HDMI 1. Everything else with Audacity and alsamixer below is the same.

$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: SB [HDA ATI SB], device 0: ALC268 Analog [ALC268 Analog]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 1: Device [USB PnP Audio Device], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
...indicating we have plughw:1,0 for the input, double check and...
$ cat /proc/asound/cards
0 [SB ]: HDA-Intel - HDA ATI SB
HDA ATI SB at 0xf6400000 irq 16
1 [Device ]: USB-Audio - USB PnP Audio Device
USB PnP Audio Device at usb-0000:00:12.0-3, full speed
...appears good. If we'd also like to know the kernel detected module...
$ lsusb -t
[snip]
|__ Port 1: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 1.5M
|__ Port 3: Dev 3, If 0, Class=Audio, Driver=snd-usb-audio, 12M
|__ Port 3: Dev 3, If 1, Class=Audio, Driver=snd-usb-audio, 12M
|__ Port 3: Dev 3, If 2, Class=Audio, Driver=snd-usb-audio, 12M
|__ Port 3: Dev 3, If 3, Class=Human Interface Device, Driver=usbhid, 12M
[snip]
... so we know it's using snd-usb-audio in the kernel on
Test the hardware config for 15 seconds
$ arecord -f cd -D plughw:1,0 -d 15 test.wav

real-time monitoring

When recording, we need to be able to see the waveform and adjust levels, otherwise it wastes a halfhour with arecord test WAV's and adjusting alsamixer, just to set levels. Audacity is the only application I currently know of that shows real-time waveform during recording. I open AUDACITY in the GUI, and then overlay ALSAMIXER in a terminal, from which I can set mic levels while recording a test WAV.

There are typically problems during playback. Sample rate may need to be set to 48000 depending on chipset. Audacity uses PortAudio, not PulseAudio or ALSA. I've found it's critical to select accurately from perhaps 12 playback options -- "default" almost never works. I go to this feature...

... often the best playback device will have some oddball name like "vdownmix", but it doesnt' matter, just use it.

TTS

tts - colab account

Create a notebook in Colab, then go to Cloud and authorize a text-to-speech API, and get a corresponding JSON credential. When running the TTS python code in Colab, the script will call to that API, using the JSON credential. Of course, it's the API which does the heavy work of translating the text into an MP3 (mono, 32Kbs, 22000Hz).

Voiceover in Collab (1:39) DScience4you, 2020. Download and watch at 40% speed. Excellent.
The Json API process (8:14) Anders Jensen, 2020.
Google TTS API (page) Google, continuously updated. Explains the features of API, for example the settings for language and pitch.

tts - local system

TTS workflow has four major steps. Some pieces are reuseable: 1) producing the text to be read, 2) writing the code translating the text to speech, 3) in the code, selecting translation settings for that project and, 4) editing output WAV or MP3's. All four must be tweaked to match formats, described in another post:
  • script :: any text editor will do.
  • code :: typically Python, as described here (11:25), and using gTTS.
    • # pacman -S python-pip python-pipreq pipenv to get pip. Pipenv is a VE sandbox, necessary so that pip doesn't break pacman's index or otherwise disturb the system. Also, pip doesn't require sudo. Eg: $ pip install --user gtts.
  • name one's script anything besides "gtts.py" since that will spawn errors -- it's one of the libraries. It is the gtts library which interfaces with the Cloud Google TTS API.
  • The "=" is used for variables. The output file is not a variable, so just put the parens near the method: output.save("output.mp3")
  • TTS engine :: varies from eSpeak to the Google API gtts, from which we import gTTS.
  • audio :: levels, pauses, file splittingm=, and so on. WAV is easiest format, eg with SoX, GoldWave, Jokosher, etc.


hardware

Some current and planned audio and video hardware

video

  • GoPro Hero
  • various old cell phones,if they charge up and have room for a SDC, i can use.

public domain films - archive.org
Video editing. The first problem are the inputs. Video clips from any of, say, 10 devices, has meant that the organization of video, audio, software, and hardware have seemed impossible to me. For years. Appropriating UML toward this chaos became the first relief, then other pieces eventually fell into place.

A second problem with TTS, is it relies on Python for something with as much speech as a script. And this means doing a separate, non-Arch, Python install and pip to get packages - or - sticking with Arch and avoiding pip in favor of yay. Blending pip updates with Arch means disaster the next time you attempt a simple pacman -Syu. The problem with yay/AUR is it may not have the modules needed to run TTS.

I do separate installs on anything that has its own updaters: Python, Latex, GIT. It's a slight pain in the ass making sure all paths are updated, but I can keep them current without harming my Arch install and updates.

UML

Universal Modeling Language ("UML") is for software, but we know its various diagrams work well for, eg. workflow, files, and concepts. Derek, per usual, has a concise conceptual overview (12:46, 2012).

We can create UML in UMLet or Dia, though we will eventually want to do them in LaTeX. Lists, esp with links and so on, are best created in HTML (Geany/GTK). It's good to keep backups of these overarching UML and HTML documents on Drive.
  • Dia: # pacman -S dia/$ dia :: select the UML set of graphic icons, switch to letter-size from A4. Files are natively saved with DIA extensions, but can be exported to PNG.
  • UMLet: $ yay -S umlet/$ umlet :: UMLet is the software in Derek's video above. Sadly, allows Oracle into one's system by requiring Java headers and some other files. UMLET extension is UXF (their XML schema), but can be exported in graphical formats.

audio

(UML on-file) Like video, audio arrives in any number of formats and must be standardized prior to remixing with video. Some say to sample 24 bit, some say 16 is just fine. Capturing old CD's, maybe 24/192kbv, but it's also fine to do 16/192kbv. For video or backup DVD, 16/192kb at 44100 seems to be a good output for remixing.
  • GoldWave: (proprietary) :: easy wine bottle. Does well with WAV.
  • Removed audio: audio source. audio separated from raw video for further processing
  • TTS: audio source. creating a script and using some method to change to audio for the video.
  • Mixxx: combine music, TTS, and anything else at proper levels.
  • Qjackctl: # pacman -S qjackctl/$ qjackctl :: JACK manager to work alongside ALSA.

pulseaudio notes

Links:PulseAudio configuration :: PulseAudio examples

Go to /etc/pulse/ for the two main configuration files: default.pa, and client.conf. Any sink you add in ALSA will be detected in PulseAudio, eg. # modprobe snd_aloop will turn on at least one loopback device.

Multiple Audio Collection (11:29) Kris Occhipinti, 2017. How to do multiple audio sources.

No comments: