{"id":2312,"date":"2012-07-28T00:00:00","date_gmt":"1970-01-01T00:00:00","guid":{"rendered":"http:\/\/johnjohnston.info\/blog\/?e=2312"},"modified":"2012-07-28T00:00:00","modified_gmt":"1970-01-01T00:00:00","slug":"ocr-via-dropbox-with-tesseract","status":"publish","type":"post","link":"https:\/\/johnjohnston.info\/blog\/ocr-via-dropbox-with-tesseract\/","title":{"rendered":"OCR via dropbox with Tesseract"},"content":{"rendered":"\n<p>This is going to be another slightly geeky post. The previous one, <a href=\"http:\/\/johnjohnston.info\/blog\/?e=2311\">Testing a new system<\/a>, was about a way to blog using dropbox and AppleScript folder actions had me thinking about other things that could be done using this sort of system. The way I am doing this relays on having dropbox and a mac that is on when you want it. If you don&#8217;t have a mac you might like  <a href=\"http:\/\/beta.wappwolf.com\/\">Wappwolf<\/a> which is a web service that can do a lot of things with files in your dropbox automatically.<\/p>\n<p>So I already have a <a href=\"http:\/\/johnjohnston.info\/blog\/?e=2311\" title=\"Testing a new system - John's World Wide Wall Display\">system for blogging by dropping files into a folder on my dropbox<\/a> and was looking around for another idea to play with. There seems to be a few OCR apps for iPhones but I had noticed that  <a href=\"http:\/\/code.google.com\/p\/tesseract-ocr\/\">Tesseract<\/a> was available on <a href=\"http:\/\/code.google.com\/\">Google Code<\/a> and googled around to see how it could be installed and run on a mac. One I found was <a href=\"http:\/\/www.malcolmhardie.com\/ocr\/index.html\">TesseractOCR Mac<\/a> a  Cocoa Front end to the Tesseract OCR program. I downloaded this and gave it a try. It worked well on my desktop. I then struck gold: <a href=\"http:\/\/blog.bobkuo.com\/2011\/02\/installing-and-using-tesseract-2-04-on-mac-os-x-10-6-6-with-homebrew\/\">Installing and using Tesseract 2.04 on Mac OS X 10.6.6 with Homebrew | Ramble On<\/a>. This post explains clearly how to install Tesseract on a mac so that it can be used on the command line. It is also a good intro to <a href=\"http:\/\/mxcl.github.com\/homebrew\/\">homebrew<\/a>.<\/p>\n<h3>Homebrew<\/h3>\n<blockquote><p>Homebrew is the easiest and most flexible way to install the UNIX tools Apple didn&#8217;t include with OS X.<\/p><\/blockquote>\n<p> For someone who has struggled with this sort of thing before,  homebrew is pretty straightforward. Installing homebrew is just a case of copying a line of code from the <a href=\"https:\/\/github.com\/mxcl\/homebrew\/wiki\/installation\">installation<\/a> page, pasting it into the terminal and pressing return.<\/p>\n<h3>imagemagick<\/h3>\n<p>Following the <a href=\"http:\/\/blog.bobkuo.com\/2011\/02\/installing-and-using-tesseract-2-04-on-mac-os-x-10-6-6-with-homebrew\" title=\"Installing and using Tesseract 2.04 on Mac OS X 10.6.6 with Homebrew&nbsp;|&nbsp;Ramble On\">instructions<\/a> from Ramble On I just typed  <em>brew install imagemagick<\/em> in the terminal and hit return. Lots of scary text scrolls by:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/johnjohnston.info\/blog\/images\/2012-07\/2012-07-28_install_imagemagick.jpg\" alt=\"Install Imagemagick\" height=\"176\" width=\"498\"><\/p>\n<h3>installing Tesseract<\/h3>\n<p>Once imagemagick was installed I repeated the process for Tesseract.<\/p>\n<h3>Testing Tesseract<\/h3>\n<p>As I was wanting to figure out how to use my phoe for OCR I took a photo of a bit of newspaper, I used <a href=\"http:\/\/itunes.apple.com\/gb\/app\/camera+\/id329670577?mt=8\" title=\"App Store - Camera+\">Camera+<\/a>, the clarity filter, cropped and made the image Black and White:<\/p>\n<p><a href=\"http:\/\/www.flickr.com\/photos\/troutcolor\/7662446808\/\" title=\"fr_160 by troutcolor, on Flickr\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/farm8.staticflickr.com\/7125\/7662446808_84dd91345b.jpg\" width=\"480\" height=\"500\" alt=\"fr_160\"><\/a><br  \/>Click the image to see fullsized on flickr<\/p>\n<p>I used <a href=\"http:\/\/itunes.apple.com\/gb\/app\/wifi-photo-transfer\/id380326191?mt=8\" title=\"App Store - WiFi Photo Transfer\">Wifi Photo Transfer<\/a> to grab the photo from my camera and put it on the desktop.\n<\/p>\n<p>The OCR process is in two steps using the terminal and the newly installed applications:<\/p>\n<ol>\n<li>Convert to 200dpi tiff:<em><br  \/>cd Desktop<br  \/>convert -density 200 -units PixelsPerInch -type Grayscale +compress fr_160.jpg fr_160.tif<\/em><\/li>\n<li>Preform OCR on the tif<br  \/> <em>tesseract fr_160.tif fr_160 -1 eng<\/em><\/li>\n<\/ol>\n<p>\tI now have two extra files on my desktop, fr_160.tif and fr_160.txt, the txt file contains the OCR text:<\/p>\n<pre>(_;oogle is facing fresh criticism after\nadmitting that it has not deleted all of the\nprivate data, including emails and pass-\nwords, it secretly collected from internet\nusers around the UK as it gathered data for\nits Street View maps.\nThe search &#xFB01;rm was ordered in Decem-\nber 2010 to delete the private information\nhoovered up by its Street View cars from\nopen Wi-Fi networks. r\nBut yesterday Google told the Infor-\nmation Commissioner&rsquo;s Of&#xFB01;ce &ldquo;human\nerror&rdquo; had prevented it from erasing the\ndata, which could include the millions of\nemails and passwords .\nGoogle admitted in May 2010 its Street\nView cars had &ldquo;mistakenly&rdquo; collected pri-<\/pre>\n<p>Which is pretty good.<\/p>\n<h3>OCR for dropbox<\/h3>\n<p>I now can see that tesseract works well and needed to make it work on images added to a particular dropbox folder.<\/p>\n<p>There are a few folder action scripts that come with a mac, there are in \/Library\/Scripts\/Folder Action Scripts\/ several of these deal with images files ad contain routines for handling the dropping of files. These &#8216;standard&#8217; routines move added files of the correct file type to a subfolder and then pass them on to a sub-routine that deals with the files. I could just duplicated one of these and edit the <em>process_item<\/em> sub routine. Basically I just scripted the process tested above. I&#8217;ve uploaded the script <a href=\"http:\/\/johnjohnston.info\/pmwiki\/uploads\/Software\/ocr_folder_action.html\" title=\"\">ocr folder action<\/a> as html, incase anyone will find it useful or fun. <\/p>\n<p>To use the script you put it in the Folder Action Scripts (copy the text of the html file paste it in the appleScript script editor.). Add a folder to dropbox and attached the script to that (right click on the folder and choose <strong>Folder Actions Setup&#8230;<\/strong>).<\/p>\n<p>Most of my bit of the script just uses <strong>do shell script<\/strong> to run the scripts above, the only gotcha was that although I can use <em>convert<\/em> in the terminal, in a script I have to use the full path to the script:<br  \/>set ocrscript to <\/p>\n<p><em>&quot;\/usr\/local\/Cellar\/tesseract\/3.01\/bin\/tesseract &#x27;&quot; &amp; tif_file &amp; &quot;&#x27; &#x27;&quot; &amp; tif_file &amp; &quot;&#x27; -1 eng&quot;<br  \/><br \/>\n\t\tdo shell script ocrscript<\/em><\/p>\n<p>This is to do with the way homebrew installs applications and the fact AppleScript doesn&#8217;t access commands from <strong>\/usr\/local\/&#8230;<\/strong>.<\/p>\n<p><a href=\"http:\/\/johnjohnston.info\/pmwiki\/uploads\/Software\/ocr_folder_action.html\" title=\"\">My script<\/a> is fairly crude, especially about file endings, if I add :<strong>Photo 28-07-2012 12 35 55.jpg<\/strong> to the dropbox folder, it is moved into the processed files folder and <strong>Photo 28-07-2012 12 35 55.jpg.tif<\/strong> and <strong>Photo 28-07-2012 12 35 55.jpg.tif.txt<\/strong> are created. Not elegant.<\/p>\n<p>The whole process from taking a photo to opening the txt file in dropbox only takes a couple of minutes when using 3G. The system will not deal with columns or more than a single block of text but it does that fairly well. Mostly it was fun to figure out how to do.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is going to be another slightly geeky post. The previous one, Testing a new system, was about a way to blog using dropbox and AppleScript folder actions had me thinking about other things that could be done using this sort of system. The way I am doing this relays on having dropbox and a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"mf2_syndication":[],"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"webmentions_disabled_pings":false,"webmentions_disabled":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[93,94,120,127,128],"post_format":[],"class_list":{"0":"post-2312","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-wwwd","7":"tag-applescript","8":"tag-commandline","9":"tag-dropbox","10":"tag-ocr","11":"tag-osx","12":"kind-","13":"h-entry","14":"hentry"},"better_featured_image":null,"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p57zFQ-Bi","jetpack_likes_enabled":false,"jetpack_sharing_enabled":true,"kind":false,"_links":{"self":[{"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/posts\/2312","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/comments?post=2312"}],"version-history":[{"count":0,"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/posts\/2312\/revisions"}],"wp:attachment":[{"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/media?parent=2312"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/categories?post=2312"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/tags?post=2312"},{"taxonomy":"post_format","embeddable":true,"href":"https:\/\/johnjohnston.info\/blog\/wp-json\/wp\/v2\/post_format?post=2312"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}