Willus.com's Blog

Willus.com

Home | Archive | About

Willus.com's Blog
Useful tips, benchmarks, computing experiences, policy views, etc.

MENU

Show all
12-29-23	ffmpeg bmark 2023
12-26-23	gcc13 benchmark
2-19-23	blog revamp
1-6-23	gcc12 benchmark
1-6-23	tesseract accuracy
7-24-22	ffmpeg bmark 2022
5-14-22	gcc11 benchmark
2-12-22	compression bmark
6-6-21	win7 tips
1-23-21	mac M1 cpu
6-12-20	even github
3-6-20	ffmpeg bmark 2020
1-16-19	holiday song
1-16-19	one month ago
6-9-18	made up plots
2-11-18	playbuzz
7-5-17	driver deletion
6-3-17	ffmpeg bmark 2017
5-20-17	hosting history
1-13-17	mobile performance
1-28-15	streaming videos
11-21-14	dlp lamp life
7-15-14	ewaste
12-31-12	ipad jailbreak
11-21-12	ffmpeg bmark 2012
1-13-12	ubuntu11
12-26-11	ffmpeg bmark 2011b
12-15-11	oil change
10-9-11	ffmpeg bmark 2011
1-29-10	win7
9-28-08	energy usage
6-10-08	booting disasters
1-8-08	drm
1-6-08	wine
12-28-07	linux experiment
3-3-07	ferber method
11-11-05	dipole fm antenna
10-19-03	spam tips
10-16-03	spammed

6 JAN 2023 -- TESSERACT ACCURACY

Since 2018, I have been testing Tesseract's OCR engine against the resolution of the text. I wrote a script to auto-generate a test PDF file (here is an example using Helvetica Narrow font) with different resolution text in six different fonts (Helvetica, Times-Roman, Courier, Palatino, Bookman, and Helvetia-Narrow). I then run Tesseract on the different PDF's and determine the accuracy of the OCR. I characterize the resolution by the height of a typical capital letter in pixels. It turns out that there is a sweet spot for Tesseract of about 30 pixels for the height of a capital letter (seems strange to me that it would not continue to improve at higher and higher resolutions, but okay). See the plot below. My software k2pdfopt uses this result and tries to optimize OCR text size to be in this "sweet spot."

This page last modified
12-29-2023