Side Note: TIME CALCULATOR. This post is FFMPEG, but there's a UAV guy who has a YT channel, who works nearly exclusively with SHOTCUT. Some of the effects are amazing, eg his video on smoothing. There's also an FFMPEG webpage with pan tilt and zooming info not discussed this post. For smoother zooming, look here at pzoom possibilities. Finally, for sound and video sync, this webpage is the best I've seen. Sync to the spike.
Suppose I use a phone for a couple short videos, maybe along the beach. One is 40 seconds, the other 12. On the laptop I later attempt to trim away any unwanted portions, crossfade them together, add some text, and maybe add audio (eg. narration, music). This might take 2 hours the first or second attempt: it takes time for workflows to be refined/optimized for one's preferences. Although production time decreases somewhat with practice, planning time is difficult to eliminate entirely: every render is lossy and the overall goal (in addition to aesthetics) is to accomplish editing with a minimum number of renders.
normalizing
The two most important elements to normalize before combining clips of different sources are the timebase and the fps. Ffmpeg can handle most other differing qualities: aspect ratios, etc. There are other concerns for normalizing depending on what the playback device is. I've had to set YUV on a final render to get playback on a phone before. But this post is mostly about editing disparate clips.
Raw video from the phone is in 1/90000 timebase (ffmpeg -i), but ffmpeg natively renders at 1/11488. Splicing clips with different timebases fails, eg crossfades will exit with the error...
First input link main timebase (1/90000) do not match the corresponding second input link xfade timebase (1/11488)
Without a common timebase, the differing "clocks" cannot achieve a common outcome. It's easy to change the timebase of a clip, however it's a render operation. For example, to 90000...
$ ffmpeg -i video.mp4 -video_track_timescale 90000 output.mp4
If I'm forced to change timebase, I attempt to do other actions in the same command, so as not to waste a render. As always, we want to render our work as few times as possible.
separate audio/video
Outdoor video often has random wind and machinery noise. We'd like to turn it down or eliminate it. To do this, we of course have to separate the audio and video tracks for additonal editing. Let's take our first video, "foo1.mp4", and separate the audio and video tracks. Only the audio is rendered, if we remember to use "-c copy" on the video portion, to prevent video render.
$ ffmpeg -i foo1.mp4 -vn -ar 44100 -ac 2 audio.wav
$ ffmpeg -i foo1.mp4 -c copy -an video1.mp4
cropping*
*CPU intensive render, verify unobstructed cooling.
This happens a lot with phone video. We want some top portion but not the long bottom portion. Most of my stuff is 1080p across the narrow portion, so I make it 647p tall for a 1.67:1 golden ratio. 2:1 would also look good.
$ ffmpeg -i foo.mp4 -vf "crop=1080:647:0:0" -b 5M -an cropped.mp4
The final zeroes indicate to start measuring pixels in upper left corner for both x and y axes respectively. Without these, the measurement starts from center of screen. Test the settings with ffplay prior to the render. Typically anything with action will require 5M bitrate on the render, but this setting isn't needed during the ffplay testing, only the render.
cutting
Cuts can be accomplished without a render if the "-c copy" flag is used. Copy cuts occur on the nearest keyframe. If a cut requires the precision of a non-keyframe time, the clip needs to be re-rendered. The last one in this list is an example.
- no recoding, save tail, delete leading 20 seconds. this method places seeking before the input and it will go to the closest keyframe to 20 seconds.
$ ffmpeg -ss 0:20 -i foo.mp4 -c copy output.mp4
- no recoding, save beginning, delete tailing 20 seconds. In this case, seeking comes after the input. Suppose the example video is 4 minutes duration, but I want it to be 3:40 duration.
$ ffmpeg -i foo.mp4 -t 3:40 -c copy output.mp4
- no recoding, save an interior 25 second clip, beginning 3:00 minutes into a source video
$ ffmpeg -ss 3:00 -i foo.mp4 -t 25 -c copy output.mp4
- a recoded precision cut
$ ffmpeg -i foo.mp4 -t 3:40 -strict 2 output.mp4
2. combining/concatentation
Also see further down the page for final audio and video recombination. The section here is primarily for clips.
codec and bitrate
If using ffmpeg, then mpeg2video is the fastest lib, but also creates the largest files. Videos with a mostly static image, like a logo, may only require 165K video encoding. A 165K example w/125K audio, 6.5MB for about 5 minutes. That said, bitrate is the primary determiner of rendered file size. Codec is second but important, eg, libx264 can achieve the same quality at a 3M bitrate for which mpeg2video would require a 5M bitrate.
simple recombination 1 - concatenate (no render)
The best results come from combine files with least number of renders. This way does it without rendering... IF files are the same pixel size and bit rate, this way can be used. Put the names of the clips into a new TXT file, in order of concatenation. Absolute paths is a way to be sure. Each clip takes one line. The one here shows both without and with absolute path.
$ nano mylist.txt
# comment line(s)
file video1.mp4
file '/home/foo/video2.mp4'
The command is simple.
$ ffmpeg -f concat -safe 0 -i mylist.txt -c copy output.mp4
Obviously, transitions which fade or dissolve require rendering; either way, starting with a synced final video, with a consistent clock (tbr), makes everything after easier.
simple recombination 2 - problem solving
Most problems come from differing tbn, pixel size, or bit rates. TBN is the most common. It can be tricky though, because the video after the failing one appears to cause the fail. Accordingly comment out files in the list to find the fail, then try replacing the one after it.
- tbn: I can't find in the docs whether the default ffmpeg clock is 15232, or 11488, I've seen both. Most phones are on a 90000 clock. If the method above "works", but reports many errors and the final the time stamp is hundreds of minutes or hours long, then it must be re-rendered. Yes it's another render, but proper time stamps are a must. Alternatively, I suppose a person could re-render each clip with the same clock. I'd rather do the entire file. As noted higher up in the post, raw clips from a phone usually use 1/90000 but ffmpeg uses 1/11488. It's also OK to add a gamma fix or anything else, so as not to squander the render. The example here I added a gamma adjustment
$ ffmpeg -i messy.mp4 -video_track_timescale 11488 [or 15232] -vf "eq=gamma=1.1:saturation=0.9" output.mp4
combine - simple, one file audio and video
I still have to specify the bitrate or it defaults too low. 3M for sports, 2M for normal person talking.
$ ffmpeg -i video.mp4 -i audio.wav -b:v 3M output.mp4
combine - simple, no audio/audio (render)
If the clips are different type, pixel rate, anything -- rendering is required. Worse, mapping is required. Leaving out audio makes it slightly less complex.
ffmpeg -i video1.mp4 -i video2.flv -an -filter_complex \
"[0:v][1:v] concat=n=2:v=1 [outv]" \
-map "[outv]" out.mp4
Audio adds an additional layer of complexity
ffmpeg -i video1.mp4 -i video2.flv -filter_complex \
"[0:v][0:a][1:v][1:a] concat=n=2:v=1:a=1 [outv] [outa]" \
-map "[outv]" -map "[outa]" out.mp4
combining with effect (crossfade)
*CPU intensive render, verify unobstructed cooling.
If you want a 2 second transition, run the offset number 1.75 - 2 seconds back before the end of the fade-out video. So, if foo1.mp4 is a 12 second video, I'd run the offset to 10, so it begins fading in the next video 2 seconds prior to the end of foo1.mp4. Note that I have to use filter_complex, not vf, because I'm using more than one input. Secondly, the offset can only be in seconds. This means that if the first video were 3:30 duration, I'd start the crossfade at 3:28, so the offset would be "208".
$ ffmpeg -i foo1.mp4 -i foo2.mp4 -filter_complex xfade=transition=fade:duration=2:offset=208 output.mp4
If you want to see how it was done prior to the xfade filter, look here, as there's still a lot of good information on mapping.
multiple clip crossfade (no audio)
Another scenario is multiple clips with the same transition, eg a crossfade. In this example 4 clips (so three transitions), each clip 25 seconds long. A good description.
$ ffmpeg -y -i foo1.mp4 -i foo2.mp4 \
-i foo3.mp4 -i foo4.mp4 -filter_complex \
"[0][1:v]xfade=transition=fade:duration=1:offset=24[vfade1]; \
[vfade1][2:v]xfade=transition=fade:duration=1:offset=48[vfade2]; \
[vfade2][3:v]xfade=transition=fade:duration=1:offset=72" \
-b:v 5M -s wxga -an output.mp4
Some additional features in this example:
y to overwrite prior file,
5M bitrate, and size
wxga, eg if reducing quality slightly from 4K to save space. Note that the duration mesh time increases the total offset cumulatively. I touched a TXT file and entered the values for each clip and its offset. Then just "cat"ted the file to see all the offset values when I built my command. Suppose I had like 20 clips? The little 10ths and so on might add up. Offset numbers off by more than a second will not combine with the next clip, even though syntax is otherwise correct.
$ cat fooclips.txt
foo1 25.07 - 25 (24)
foo2 25.10 - 50 (48)
foo3 25.10 - 75 (72)
foo4 (final clip doesn't matter)
multiple clip crossfade (with audio)
This is where grown men cry. Have a feeling if I can get it once, won't be so bad going forward but, for now, here's some information. It appears some additional programs beide crossfade and xfade.
fade-in/out*
*CPU intensive render, verify unobstructed cooling.
If you want a 2 second transition, run the offset number 1.75 - 2 seconds back before the end of the fade-out video. Let's say we had a 26 second video, so 24 seconds.
$ ffmpeg -i foo.mp4 -max_muxing_queue_size 999 -vf "fade=type=out:st=24:d=2" -an foo_out.mp4
color balance
Recombining is also a good time to do even if just a basic eq. I've found that general color settings (eg. 'gbal' for green) have no effect, but that fine grain settings (eg. 'gs' for green shadows) has effects.
$ ffmpeg -i video.mp4 -i sound.wav -vf "eq=gamma=0.95:saturation=1.1" codec:v copy recombined.mp4
There's an easy setting called "curves", like taking an older video and moving midrange from .5 to .6 helps a lot. Also, if bitrate is specified, give it before any filters; bitrate won't be detected after the filter.
$ ffmpeg -i video.mp4 -i sound.wav -b:v 5M -codec:v mpeg2video -vf "eq=gamma=0.95:saturation=1.1" recombined.mp4
Color balance intensity of colors. There are 9 settings - 1 each for RGB (in that order) for shadows, midtones, and highlights, separated by colons. For example, if I wanted to decrease the red in the highlights and leave all others unchanged...
$ ffmpeg -i video.mp4 -i -b:v 5M -vf "colorbalance=0:0:0:0:0:0:-0.4:0:0" output.mp4
A person can also add -pix_fmt yuv420p, if they want to make it most compatible with Windows
FFMpeg Color balancing(3:41) The FFMPEG Guy, 2021. 2:12 color balancing shadows, middle, high for RGB
why films are shot in 2 colors (7:03) Wolfcrow, 2020. Notes that skin is the most important color to get right. The goal is often to go with two colors on oppposite ends of the color wheel or which are complementary.
PBX - on-site or cloud (35:26) Lois Rossman, 2016. Cited mostly for source 17:45 breaks down schematically.
PBX - true overhead costs (11:49) Rich Technology Center, 2020. Average vid, but tells hard facts. Asteriks server ($180) discussed.
adding text
*CPU intensive render, verify unobstructed system cooling.
For one or more lines of text, we can use the "drawtext" ffmpeg filter. Suppose we want to display the date and time of a video, in Cantarell font, for six seconds, in the upper left hand corner. If we have a single line of text, we can use ffmpeg's simple filtergraph (noted by "vf"). 50 pt font should be sufficient size in 1920x1080 video.
$ ffmpeg -i video.mp4 -vf "[in]drawtext=fontfile=/usr/share/fonts/cantarell/Cantarell-Regular.otf:fontsize=50:fontcolor=white:x=100:y=100:enable='between(t,2,8)':text='Monday\, January 17, 2021 -- 2\:16 PM PST'[out]" videotest.mp4
Notice that a backslash must be added to escape special characters: Colons, semicolons, commas, left and right parens, and of course apostrophe's and quotation marks. For this simple filter, we can also omit the [in] and [out] labels. Here is a screenshot of how it looks during play.
Next, supposing we want to organize the text into two lines. We'll need one filter for each line. Since we're still only using one input file to get one output file, we can still use "vf", the simple filtergraph. 10pixels seems enough to separate the lines, so I'm placing the second line down at y=210.
$ ffmpeg -i video.mp4 -vf "drawtext=fontfile=/usr/share/fonts/cantarell/Cantarell-Regular.otf:fontsize=50:fontcolor=white:x=100:y=150:enable='between(t,2,8)':text='Monday\, January 18\, 2021'","drawtext=fontfile=/usr/share/fonts/cantarell/Cantarell-Regular.otf:fontsize=50:fontcolor=white:x=100:y=210:enable='between(t,2,8)':text='2\:16 PM PST'" videotest2.mp4
We can continue to add additional lines of text in a similar manner. For more complex effects using 2 or more inputs, this 2016 video is the best I've seen.
Ffmpeg advanced techniques pt 2 (19:29) 0612 TV w/NERDfirst, 2016. This discusses multiple input labeling for multiple filters.
PNG incorporation
If I wanted to do several lines of information, an easier solution than making additional drawtexts, is to create a template the same size as the video, in this case 1980x1080. Using, say GiMP, we could create picture with an alpha template with several ines that we might use repeatedly, and save in Drive. There is then an ffmpeg command to superimpose a PNG over the MP4.
additional options (scripts, text files, captions, proprietary)
We of course have other options for skinning the cat: adding calls to text files, creating a bash script, or writing python code to call and do these things.
The simplest use of a text files are calls from the filter in place of writing the text out each filter.
viddyoze: proprietary,online video graphics option. If no time for Blender, pay a little for the graphics and they will rerender it on the site.
viddyoze review (14:30) Jenn Jager, 2020. Unsponsored review. Explains most of the 250 templates. Renders to quicktime (if alpha), or MP4 is not.~12 minute renders
adding text 2
Another way to add text is by using subtitles, typically with an SRT file. As far as I know so far, these are controlled by the viewer, meaning not "forced subtitles" which override viewer selection. here's a page. I've read some sites on forced subtitles but haven't yet been able to do this with ffmpeg.
audio and recombination
Ocenaudio makes simple edits sufficient for most sound editing. It's user friendly along the lines of the early Windows GoldWave app 25 years ago. I get my time stamp from the video first.
$ ffmpeg -i video.mp4
Then I can add my narration or sound after being certain that the soundtrack is exactly a map for the timestamp of the video. I take the sound slightly above neutral "300" when going to MP3 to compensate for transcoding loss. 192K is typically clean enough.
$ ffmpeg -i video.mp4 -i sound.wav -acodec libmp3lame -ar 44100 -ab 192k -ac 2 -vol 330 -vcodec copy recombined.mp4
I might also resize it for emailing, down to VGA or SVGA size. Just change it thusly...
$ ffmpeg -i video.mp4 -i sound.wav -acodec libmp3lame -ar 44100 -ab 192k -ac 2 -vol 330 -s svga recombined.mp4
$ ffmpeg -i video.mp4 -i sound.wav -acodec libmp3lame -ar 44100 -ab 192k -ac 2 -vol 330 -vcodec copy recombined.mp4
I might also resize it for emailing, down to VGA or SVGA size. Just change it thusly...
$ ffmpeg -i video.mp4 -i sound.wav -acodec libmp3lame -ar 44100 -ab 192k -ac 2 -vol 330 -s svga recombined.mp4
For YouTube, there's a recommended settings page, but here's a typical setup:
$ ffmpeg -i video.mp4 -i sound.wav -vcodec copy recombined.mp4
ocenaudio - no pulseaudio
If a person is just using alsa, without any pulse, they may have difficulty (ironically) using ocenaudio, if HDMI is connected. A person has to go into Edit->preferences, select the ALSA backend, and then play a file. Keep trying your HDMI ports until you luck on the one with EDID approved.
audio settings for narration
To separately code the audio in stereo 44100, 192K, ac 2, some settings below for Ocenaudio: just open it and hit the red button. Works great. Get your video going, then do the audio.
Another audio option I like is to create a silent audio file exactly the length of the video, and then start dropping in sounds into the silence, hopefully in the right place with audio I may have. Suppose my video is 1:30.02, or 90.02 seconds
$ sox -n -r 44100 -b 16 -c 2 -L silence.wav trim 0.0 90.02
Another audio option is to used text to speeech (TTS) to manage some narration points. The problem is how to combine all the bits into a single audio file to render with the audio. The simplest way seems to be to create the silence file then blend. For example, run the video in a small window, open ocenaudio and paste at the various time stamps. By far the most comprehensive espeak video I've seen.
How I read 11 books(6:45) Mark McNally, 2021. Covers commands for pitch, speed, and so on.
phone playback, recording
The above may or may not playback on a phone. If one wants to be certain, record a 4 second phone clip and check its parameters. A reliable video codec for video playback on years and years of devices:
-vcodec mpeg2video -or- codec:v mpeg2video
Another key variable is yuv. Eg,...
$ ffmpeg -i foo.mp4 -max_muxing_queue_size 999 -pix_fmt yuv420p -vf "fade=type=out:st=24:d=2" -an foo_out.mp4
To rotate phone video, often 90 degree CCW or CW, requires the "transpose" vf. For instance this 90 degree CCW rotation...
$ ffmpeg -i foo.mp4 -vf transpose=2 output.mp4
Another issue is shrinking the file size. For example, mpeg2video at 5M, will handle most movement and any screen text but creates a file that's 250M for 7 minutes. Bitrate is the largest determiner of size, but also check the frame rate (fps), which sometimes can be cut down to 30fps ( -vf "fps=30") if it's insanely high for the action. Can always check stuff with ffplay to get these correct before rendering. Also, if the player will do H264, then encoding in H264 (-vcodec libx264) at 2.5M bitrate looks similar to 5M in MPEG2. Which means about 3/5 the file size.