PDF Notes: Difference between revisions

From Pikes' Wiki
Jump to navigation Jump to search
(Created page with " =References= * https://medium.com/@kaerumy/cleaning-up-scanned-documents-with-open-source-tools-9d87e15305b * https://github.com/scantailor/scantailor/wiki/Split-Pages * htt...")
 
No edit summary
 
Line 35: Line 35:
<pre>
<pre>
find . -regextype posix-extended  -regex ".*/[0-9]{8}_[0-9]{6}.*" -exec rename -v 's/(\d{4})(\d{2})(\d{2})_(\d{2})(\d{2})(\d{2})/$1-$2-$3_$4.$5.$6/' {} \;
find . -regextype posix-extended  -regex ".*/[0-9]{8}_[0-9]{6}.*" -exec rename -v 's/(\d{4})(\d{2})(\d{2})_(\d{2})(\d{2})(\d{2})/$1-$2-$3_$4.$5.$6/' {} \;
</pre>
=Bash Script for processing scans=
<pre>
cd <sourcefolder>
tiffcp <list of tif files> <output.pdf>
cp <output.pdf> //wormhole/Media/OCRMyPDF/Input
cp //wormhole/Media/OCRMyPDF/Output/<output.pdf> <destination folder>
</pre>
</pre>

Latest revision as of 14:04, 10 December 2021


References

Bash Script for converting Magazine

for fn in *.pdf ; do
	echo $fn
	#Cleanup from prior runs
	rm -f tmp/images*tif
	#Split pdf pages into individual tif files
	pdfimages -tiff "$fn" ./tmp/images

	#combine into a single tif file
	tiffcp tmp/images*tif $(basename "$fn" .pdf).tif

	####put combined TIF into docker folder for OCRMyPDF and wait for output

	#
	tiff2pdf -o ../../../OCRMyPDF/Input/$fn $(basename "$fn" .pdf).tif
done
rm tmp/images*tif

Bash command to rename yyyymmdd_hhmmss to yyyy-mm-dd_hh.mm.ss

find . -regextype posix-extended  -regex ".*/[0-9]{8}_[0-9]{6}.*" -exec rename -v 's/(\d{4})(\d{2})(\d{2})_(\d{2})(\d{2})(\d{2})/$1-$2-$3_$4.$5.$6/' {} \;

Bash Script for processing scans

cd <sourcefolder>
tiffcp <list of tif files> <output.pdf>
cp <output.pdf> //wormhole/Media/OCRMyPDF/Input
cp //wormhole/Media/OCRMyPDF/Output/<output.pdf> <destination folder>