/* ** k2pdfopt.c Optimize 1 and 2-column PDF's for Kindle-2 by displaying ** columns separately and stripping removing margins and ** excess white space. ** ** http://willus.com ** ** v1.40 4-5-2012 ** - This is my most substantial update so far. ** I did a re-write of many parts of the code, including ** all of the text wrapping functions. I also put this ** version through many hours of regression testing. ** - Major new features: ** * Does true word wrap (brings words up from the ** next line if necessary). ** * Preserves indentation, justification, and vertical ** spacing more faithfully. Overall, particularly for ** cases with text wrapping, I think the output looks ** much better. ** * Ignores defects in scanned documents. ** * Compiled with all of the very latest third party ** libraries, including mupdf 0.9. ** * v1.40 is about 5% faster than v1.35 on average ** (Windows 64 version). ** - New justification command-line option is: ** -j [-1|0|1|2][+/-] ** Using -1 tells k2pdfopt to use the document's own ** justification. A + after will attempt to fully ** justify the text. A - will force no full justification. ** Nothing after the number will attempt to determine ** whether or not to use full justification based on ** if the source document is fully justified. ** - The default defect size to ignore in scanned documents ** is a specified user size (default is 1 point). The ** command-line option is -de (user menu option "de"). ** - Command line options -vls, -vb, and -vs control ** vertical spacing, breaks, and gaps. They are all ** under the interactive user menu under "v". ** - Line spacing is controlled by -vls. ** Example: -vls -1.2 (the default) will preserve ** the default document line spacing up to 1.2 x ** single-spaced. If line spacing exceeds 1.2 x in the ** source document, the lines are spaced at 1.2 x. ** The negative value (-1.2) tells k2pdfopt to use it ** as a limit rather than forcing the spacing to be ** exactly 1.2 x. A positive value, on the other hand, ** forces the spacing. E.g. -vls 2.0 will force line ** spacing to be double-spaced. ** - Regions are broken up vertically using the new -vb ** option. It defaults to 2 which breaks up regions ** separated by gap 2 X larger than the median line gap. ** For behavior more like v1.35, or to not break up the ** document into vertical regions, use -vb -1. Vertical ** breaks between regions are shown with green lines when ** using -sm. ** - The new -vs option sets the maximum gap between regions ** in the source document before they are truncated. ** Default is -vs 0.25 (inches). ** - Added menu option for -cg under "co". ** - Reduced default min column gap from 0.125 to 0.1 inches. ** - The -ws (word spacing threshold) value is now specified ** as a fraction of the lowercase letter height (e.g. a ** small 'o'). The new default is 0.375. ** ** v1.35 2-16-2012 ** - Changed how the columns in a PDF file are interpreted ** when the column divider moves around some. The column ** divider is now allowed to move around on the page ** but still have the columns be considered contiguous. ** This is controlled by the -comax option. Use ** -comax -1 to revert to v1.34 and before. The ** default is -comax 0.2. See example at: ** http://willus.com/k2pdfopt/help/column_divider.shtml ** - Added nice debugging tool with the -sm command-line ** option ("sm" on interactive menu) which shows marked ** source pages so you can clearly see how k2pdfopt ** is interpreting your PDF file and what affect the ** options are having. ** - The last line in a paragraph, if shorter than the ** other lines significantly, will be split differently ** and not fully justified. ** - Modified the column search function to better find ** optimal gaps. ** - The height of a multi-column region is calculated ** more correctly now (does not include blank space, ** and both columns must exceed the minimum height ** requirement). ** - Text immediately after a large rectangular block ** (typically a figure) is now appended to the ** figure region, since it is often the axis labels ** for the figure. ** - Fixed array-out-of-bounds bugs in ** bmpregion_wrap_and_add() and break_point(). ** - Added some performance enhancements regarding how ** regions are trimmed (rowcount[] and colcount[] ** arrays). ** - The file name to be processed is now listed with the ** interactive menu, and a wildcard can now be specified ** as the file name on the interactive menu. ** ** v1.34a 12-30-2011 ** - Some build corrections after the first release of ** v1.34 which had issues in Linux and Windows. ** - Fixed interpretation of -jpg flag when it's the last ** command-line option specified. ** ** v1.34 12-30-2011 ** - I've collected enough bug reports and new feature ** requests that I decided to do an update. ** - Added -cgr and -crgh options to give more control ** over how k2pdfopt selects multi-column regions. ** - Don't switch to Ghostscript on DJVU docs. ** - Continues processing files even if has an error on ** one page. ** - Fixed bug in orientation detection (minimum returned ** value is now 0.01 so as not to kill the average). ** - Added document scale factor (-ds or "ds" in menu) ** which allows users to correct PDF docs that are the ** wrong size (e.g. if your PDF reader says your ** document is 17 x 22 inches when it should be ** 8.5 x 11, use -ds 0.5). ** - Fixed bug in break_point() where bp1 and bp2 did not ** get initialized correctly. ** ** v1.33 11-11-2011 ** - Added autodetection of the orientation of the PDF ** file. This is somewhat experimental and comes with ** several caveats, but I have made it the default ** because I think it works pretty well. ** Caveat #1: It assumes the PDF/DJVU file is mostly ** lines of text and looks for regularly spaced lines ** of text to determine the orientation. ** Caveat #2: If it determines that the page is ** sideways, it rotates it 90 degrees clockwise, so it ** may end up upside down. ** - The autodetection is set with the -rt command-line ** option (or the "rt" menu option): ** 1. Set it to a number to rotate your PDF/DJVU file ** that many degrees counter-clockwise. ** 2. Set it to "auto" and k2pdfopt will examine up ** to 10 pages of the file to determine the ** orientation it will use. ** 3. Set it to "aep" to auto-detect the rotation of ** every page. If you have different pages that ** are rotated differently from each other within ** one file, you can use this option to try to ** auto-rotate each page. ** 4. To revert to v1.32 and turn off the orientation ** detection, just put -rt 0 on the command line. ** - Added option to attempt full justification when ** breaking lines of text. This is experimental and ** will only work well if the output dpi is chosen so ** that rows break approximately evenly. To turn on, ** use the "j" option in the interactive menu or the ** -j command-line option with a + after the selection, ** e.g. ** -j 0+ (left/full justification) ** -j 1+ (center/full justification) ** -j 2+ (right/full justification) ** ** v1.32 10-25-2011 ** - Make sure locale is set so that decimal marker is ** a period for numbers. This was causing problems ** in locales where the decimal marker is a comma, ** resulting in unreadable PDF output files. This ** was introduced by having to compile for the DJVU ** library in v1.31. ** - Slightly modified compile of DJVU lib (re: locale). ** - Remove "cd" option from interactive menu (it was ** obsoleted in v1.27). ** - Warn user if source bitmap is excessively large. ** - Print more info in header (compiler, O/S, chip). ** ** v1.31 10-17-2011 ** - Now able to read DJVU (.djvu) files using ddjvuapi ** from djvulibre v3.5.24. All output is still PDF. ** - Now offer generic i386 versions for Win and Linux ** which are more compatible w/older CPUs, and fixed ** issue with MuPDF so it doesn't crash on older CPUs ** when compiled w/my version of MinGW gcc. ** ** v1.30 10-4-2011 ** - Just after I posted v1.29, I found a bug I'd introduced ** in v1.27 where k2pdfopt didn't quit when you typed 'q'. ** I fixed that. ** - Made user menu a little smarter--allows different ** entries depending on whether a source file has already ** been specified. ** ** v1.29 10-4-2011 ** - Input file dpi now defaults to twice the output dpi. ** (See -idpi option.) ** - Added option to break input pages at the end of each ** output page. ("Break pages" in menu or -bp option.) ** - Set dpi minimums to 50 for input and 20 for output. ** ** v1.28 10-1-2011 ** - Fixed bug that was causing vertical stripes to show ** up on Mac and Linux version output. ** - OSX 64-bit version now available. ** ** v1.27 9-25-2011 ** - Changed default max columns to two. There were ** too many cases of false detection of sub-columns. ** Use -col 4 to detect up to 4 columns (or select ** the "co" option in the user menu). ** - The environment variable K2PDFOPT now can be ** use to supply default command-line options. It ** replaces all previous environment variables, ** which are now ignored. The options on the ** command line override the options in K2PDFOPT. ** - Added -rt ("rt" in menu) option to rotate the source ** pages by 90 (or 180 or 270) degrees if desired. ** - Default startup is now to show the user menu rather ** than command line usage. Type '?' for command line ** usage or use the -? command line option to see usage. ** - Added three new "expert-mode" options for controlling ** detection of gaps between columns, rows, and words: ** -gtc, -gtr, -gtw. The -gtc option replaces ** the -cd option from v1.26. These can all be set ** with the "gt" menu option. Use the "u" option for ** more info (to see usage). ** - In conjunction with the new "expert-mode" options, ** I adjusted how gaps between columns, rows, and words ** are detected and adjusted the defaults to hopefully ** be more robust. ** - You can now enter all four margin settings (left, ** top, right, bottom) from the user input menu for ** "m" and "om". ** - Added -x option to get k2pdfopt to exit without asking ** you to press first. ** ** v1.26 9-18-2011 ** - Added column detection threshold input (-cd). Set ** higher to make it easier to detect multiple columns. ** - Adjusted the default column detection to make column ** detection a bit easier on scanned docs with ** imperfections. ** ** v1.25 9-16-2011 ** - Smarter detection of number of TTY rows. ** ** v1.24 9-12-2011 ** - Input on user menu fixed not to truncate file names ** longer than 32 chars for Mac and Linux. ** ** v1.23 9-11-2011 ** - Added right-to-left (-r) option for scanning pages. ** ** v1.22 9-10-2011 ** - First version compiled under Mac OS X. ** - Made some changes to run on OS X. Kludgey, but works. ** You have to double-click the icon and then drag a file ** to the display window and press . I've made ** linux work similarly. ** - Since Mac and Linux shells default to black on white, ** I've made the the text colors more friendly to that ** scheme for linux and Mac. Use -a- to turn off text ** coloring altogether, or set the env variable ** K2PDFOPT_NO_TEXT_COLORING. ** ** v1.21 9-7-2011 ** - Moved some bmp functions to standard library. ** - JPEG images always done at 8 bpc (no dithering). ** - Fixed dithering of 1-bit-per-colorplane images. ** ** v1.20 9-2-2011 ** - Added dithering for bpc < 8. Use -d- to turn off. ** - Adjusted gamma correction algorithm slightly (so that ** pure white stays pure white). ** ** v1.19 9-2-2011 ** - Added gamma adjust. Setting to a value lower than 1.0 ** will darken the font some and appear to thicken it up. ** Default is 0.5. Thanks to PaperCrop for the idea. ** - Interactive menu now uses letters for the options. ** This should keep the option choices the same even if ** I add new ones, and now the user can enter a page range ** as the final entry. ** ** v1.18 8-30-2011 ** - break_point() function now uses same white threshold ** as all other functions. ** - Added "-wt" option to manually specify "white threshold" ** value above which all pixels are considered white. ** - Tweaked the contrast adjustment algorithm and changed ** the max to 2.0 (was much higher). ** - Added "-cmax" option to limit contrast adjustment. ** ** v1.17 8-29-2011 ** - Min region width now 1.0 inches. Bug fixed when ** output dpi set too large--it is now reduced so that ** the output display has at least 1-inch of display. ** ** v1.16 8-29-2011 ** - Now queries user for options when run (just press ** to go ahead with the conversion). ** Use -ui- to disable this (it is automatically disabled ** when run from the command line in Windows). ** - Fixed bug in MuPDF calling sequence that results in ** more robust reading of PDF files. (Fixes the parsing ** of the second two-column example on my web page.) ** - Fixed bug in MuPDF library that prevented it from ** correctly parsing encrypted sections in PDF files. ** (This bug is not in the 0.8.165 tarball but it ** was in the version that I got via "git".) ** This only affected a small number of PDF files. ** - New landscape mode (not the default) is enabled ** with the -ls option. This turns the output sideways ** on the kindle, resulting in a more magnified display ** for typical 2-column files. Thanks to Taesoo Kwon ** for this idea. ** - Default PDF output is now much smaller--about half ** the original size. This is because the bitmaps are ** saved with 4 bits per colorplane (same as the Kindle). ** You can set this to 1, 2, 4, or 8 with the -bpc option. ** Thanks to Taesoo Kwon and PaperCrop for this idea. ** - Default -m value is now 0.25 inches (was 0.03 inches). ** This ignores anything within 0.25 inches of the edge ** of the source page. ** - Now uses precise Kindle 2 (and 3?) display resolution ** by default. Thanks to the PaperCrop forum for pointing ** out that Shift-ALT-G saves screenshot on Kindle. ** The kindle is a weird beast, though--after lots of ** testing, I figured out that I have to do the ** following to get it to display the bitmaps with ** a 1:1 mapping to the Kindle's 560 x 735 resolution: ** (a) Make the actual bitmap in the PDF file ** 563 x 739 and don't use the excess pixels. ** I.e. pad the output bitmap with 3 extra ** columns and 4 extra rows. ** (b) Put black dots in the corners at the 560x735 ** locations, otherwise the kindle will scale ** the bitmap to fit its screen. ** This is accomplished with the new -pr (pad right), -pb ** (pad bottom), and -mc (mark corners) options. The ** defaults are -pr 3 -pb 4 -mc. ** - New -as option will attempt to automatically straighten ** source pages. This is not on by default since it slows ** down the conversion and is somewhat experimental, but I've ** found it to be pretty reliable and it is good to use on ** scanned PDFs that are a bit tilted since the pages need ** to be straight to accurately detect cropping regions. ** - Reads 8-bit grayscale directly from PDF now for faster ** processing (unless -c is specified for full color). ** - Individual bitmaps created only in debug mode. ** k2_src_dir and k2_dst_dir folders no longer needed. ** ** v1.15 8-3-2011 ** - Substantial code re-write, mostly to clean things up ** internally. Hopefully won't introduce too many bugs! ** - Can handle up to 4 columns now (see -col option). ** - Added -c for full color output. ** - If column width is close to destination screen width, ** the column is fit to the device. Controlled with -fc ** option. ** - Optimized much of code for 8-bit grayscale bitmaps-- ** up to 50% faster than v1.14. ** - Added -wrap- option to disable text wrapping. ** - Can convert specific pages now--see -p option. ** - Added margin ignoring options: -m, -ml, -mr, -mt, -mb. ** - Added options for margins on the destination device: ** -om, -oml, -omr, -omt, -omb. ** - Min column gap now 0.125 inches and min column height ** now 1.5 inches. Options -cg and -ch added to control ** this. ** - Min word spacing now 0.25. See -ws option. ** ** v1.14 7-26-2011 ** - Smarter line wrapping and text sizing based on custom options. ** (e.g. should work better for any size destination screen ** --not just 6-inch.) ** - Bug fix. -w option fixed. ** - First page text doesn't butt right up against top of page. ** ** v1.13 7-25-2011 ** - Added more command-line options: justification, encoding ** type, source and destination dpi, destination width ** and height, and source margin width to ignore. ** Use -ui to turn on user input query. ** - Now applies a sharpening algorithm to the output images ** (can be turned off w/command-line option). ** ** v1.12 7-20-2011 ** - Fixed a bug in the PDF output that was ignored by some readers ** (including the kindle itself), but not by Adobe's reader. ** PDF files should be readable by all software now. ** ** v1.11 7-5-2011 ** - Doesn't put "Press to exit." if launched as a ** command in a console window (in Windows). No change to ** Linux version. ** ** v1.10 7-2-2011 ** - Integrated with mupdf 0.8.165 so that Ghostscript is ** no longer required! Ghostscript can still be used/ ** will be tried if mupdf fails to decypher the pdf file. ** - PDF page number count now much more reliable. ** ** v1.07 7-1-2011 ** - Fixed bugs in the pdf writing that were making the ** pdf files incompatible with the kindle. ** - Compiled w/gcc 4.5.2. ** - Added smarter determination of # of PDF pages in source, ** though it doesn't always work on newer PDF formats. ** This can cause an issue with the win32 version because ** calling Ghostscript on a page number beyond what is in ** the PDF file seems to sometimes result in an exception. ** ** v1.06 6-23-2011 ** - k2pdfopt now first tries to find Ghostscript using the registry ** (Windows only). If not found, searches path and common folders. ** - Compiled w/turbo jpeg lib 1.1.1, libpng 1.5.2, and zlib 1.2.5. ** - Correctly sources single bitmap files. ** ** v1.05 6-22-2011 ** Fixed bug in routine that looks for Ghostscript. ** Also, Win64 version now looks for gsdll64.dll/gswin64c.exe ** before gsdll32.dll/gswin32c.exe. ** ** v1.04 6-6-2011 ** No longer requires Imagemagick's convert program. ** ** v1.03 3-29-2010 ** Made some minor mods for Linux compatibility. ** ** v1.02 3-28-2010 ** Changed rules for two-column detection to hopefully avoid ** false detection. At least 0.1 inches must separate columns. ** ** v1.01 3-22-2010 ** Fixed some bugs with file names having spaces in them. ** Added program icon. Cleaned up screen output some. ** ** v1.00 3-20-2010 ** First released version. Auto adjusts contrast, clears ** edges. ** */