Saturday, September 29, 2012

EXIF data, renumbering, cropping, rotating (lossless)

Links: ExifTool :: "rename" command notes :: PDF merging

A common activity is stripping annoying JPEG EXIF data and/or renumbering shots into a reasonable schema. Of course, pulling the EXIF, say with imagemagick, typically means reducing the file quality, since any program which opens a JPG slightly reduces its quality. As for renaming, for many users, this means the prospect of a Perl or Bash script which includes the mv command.

Instead, most batch stripping and renaming can be done without a script. Two CLI applications exiftool and rename handle most situations.

renumbering

A typical problem is a set of files with the same prefix, say, "blue", but a mixture of places in their numbers. So, eg, "blue9.jpg", "blue109.jpg", "blue43.jpg". To put these in order, we want the same number of digits, say three digits. We want blue009.jpg, blue043.jpg and blue109.jpg(as-is). Suppose we have hundreds of these files. We could write a script, but are there simple commands instead? Yes. In a terminal, cd to the folder with the files and:

$ rename blue blue00 blue?.jpg
The one question mark locates the one digit file, and adds two zeros. This means blue9 will be changed to blue009. What about the two digit file, blue43.jpg? We only need to add one zero to files with two digits:
$ rename blue blue0 blue??.jpg

Now we have blue043.jpg. This is the template for any numbering system: decide on a number of digits and then one or two commands should manage it.

Exif - ExifTool

Available at the link at top. Easily extracts, writes, deletes EXIF data. I don't even recall compiling, I think all I had to do was move a copy of into /usr/bin or some such. Maybe I had to compile.

ExifTool can do several actions, but this post concerns stripping EXIF data for all photos in a directory. Go to that directory, filled with JPG files:
$ exiftool -all= *.jpg
Or, if you have exiv2 and want to losslessly remove IPTC, XMP, and EXIF data...
$ exiv2 -d a *.jpg

Renaming

Go to the directory with the files and use the rename command. Let's say they all start with "IMG" and then some number. To rename them to "Cam01_" in a numbered sequence:
$ rename IMG Cam01_ IMG*

Done.

Cropping

Scanning in 150 lines typically makes a 1256x1752 image. If I forget to set the scan size, the bottom 110 lines just show the scanner bed. To get rid of this excess, copy those JPG's (or use the originals if you don't mind them being changed) into a directory and:

$ mogrify -crop 1256x1642+0+0 *.jpg

The 0+0 is the offset, in other words the image will begin in the top left corner(0,0) and go 1256 pixels horizontally and 1642 vertically. Similarly, if I scan at 75 lines, I get a 624x877 and typically crop down to 624x822.

Rotating

$ jpegtran -rot 90/270 -trim infile.jpg > outfile.jpg

..."90" and "270" being degrees cw rotation. Eg, use "270" for 90° ccw rotation. "Trim" drops edge pixels which can't properly rotate. Also jpegtran doesn't appear to allow wildcards -- if that's correct, a script is necessary for batches with it. Thus, for lossless batching, I use mogrify after cd'ing into that folder. Eg, for 90° ccw...

$ mogrify -rotate 270 *.jpg

add OCR layer

This makes a scanned PDF searchable. Recognition is not 100%

$ yay -S ocrmypdf

pdf conversion

Edit - 2016: ImageMagick (like most apps) has become less and less intuitive to the average user over the years. 1) users should now specify "compress" to avoid a PDF file 10 times larger than the sum of the JPGs being concatenated. 2) users should attempt to match density to the scanned density. The size of the resulting PDF file in megabytes does not change due to density (unlike "compress"), but the size of the page inside the reader does change. If your scans were at 200 DPI, but your density is 100, the page size will be 2x size in the PDF browser and will not match well if concatenated with other docs:

$ convert *.jpg -compress JPEG -density 200 somefile.pdf

A rough yardstick for the density number is to use what you'd want for say, "dpi", on a printer. If one wants to change the page orientation to landscape, then add: -rotate 90 . There's more on all this here. At the bottom of that page is a link to yet a more elaborate page, etc etc.

policy file modification

In 2019, an ImageMagick security vulnerability apparently developed during conversion, such that conversion was de-authorized. To fix: change "none" to "read|write", or "all", in /etc/ImageMagick-6/policy.xml, or to the latest version, 7 or 8. The change must be in the "policymap" portion of the file. Once it's fixed, it may remain that way for a year or two but may eventually be overwritten during an update and need to be re-accomplished.

# nano /etc/ImageMagick-7/policy.xml
<policy domain="coder" rights="all" pattern="PDF,PS" />

The effect is immediate and requires no restart or logout. Any older versions still on the disk, eg version 6, will have zero effect on the current version.

Prior to 2013, ImageMagick's "convert" automatically read the density from the JPG, as well as auto-detected the compression format. A much simpler command was performed in those days:

$ convert *.jpg somefile.pdf

Concatenate/extract PDF pages

Combining several PDF's into a larger PDF (common with letter attachments) is best with GhostScript. Imagine the first input file is in1, second in2, etc...
$ gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=out.pdf in1.pdf in2.pdf

To cleanly extract one or more pages from a PDF, use Ghostscript to avoid the rasterization of extracting as a JPG:
$ gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dFirstPage=22 -dLastPage=36 -sOutputFile=outfile.pdf 100p-inputfile.pdf

Once a PDF page is isolated, make it into a reasonably clean JPG for editing.
$ convert -colorspace RGB -resize 800 -interlace none -density 300 -quality 90 input.pdf someoutput.jpg

However, in this final "convert" command, one is liable to encounter the XML security policy which requires the recommended changes noted above to /etc/ImageMagick-7.xml. The changes must go inside the "policymap" section to be effective.

evince pdf reader

Evince is a reliable gnome-based reader, however it has a incomprehensibly stupid flaw: Evince requires that gvfs be installed to display a sidebar index in the PDF. Nothing related to file systems and volume management should have been made a dependency for an application that displays PDFs. PDF readers are not file managers and can easily internally index themselves for a sidebar without an external indexer, especially an intrusive configuration and memory hog like gvfs. I now use "Okular". The interface is not as nice looking but works as well generally and without gvfs.