Correcting OCR Errors

Optical Character Recognition, commonly referred to as OCR, is the process of converting scanned images of letters and words into a electronic versions. For example, you can use the Recognize Text feature in Acrobat DC to convert an image of a page into a searchable version in which you can select text, comment on it and even edit it.

OCR is an imperfect process. While some very good originals will process at or near 100% accuracy, if you feed Acrobat a poor quality document, results will suffer. So, yes, a fax of a fax of fax is not going to OCR well. Scanned documents may also contain handwriting which seldom is recognized as text.

OCR affects search quality and that should be a concern to legal professionals. Consider a contract that may be part of your case. Perhaps the only place your client’s name can be found in the document is in handwritten Name and Signature fields.

If you use Acrobat (or other tools) to search for your client name, no result will be returned.  Since your client’s name is an important term for most cases, you might want to consider correcting key documents to enhance search results.

Fortunately, Acrobat DC includes tools to help you audit OCR quality and correct OCR errors.

Auditing OCR Quality

Acrobat offers a feature in Preflight called “Make OCR Text Visible” which can help you audit OCR quality. Here’s how to use it:

  1. OCR the document or open a previously OCR’d document.
    Tip: Choose the Enhance Scans option in the Right Hand Pane, then choose Recognize Text
  2. In the Right Hand Pane
    1. Enter Preflight in the search field
    2. Click the Preflight tool
      000_find_preflight
  3. The Preflight window opens.
    1. In the search field, enter Make OCR
    2. Select the Make OCR text visible fixup function
    3. Click Analyze and Fix
      001_find_preflight
  4. Acrobat will ask you to renamed the file. I suggest adding “_QA” to the file name.

Looking at the Results

To QA the document, first open the Layers Panel in the file:

002_open_layers_panel

The Layers panel show two layers:

  • Invisible text
  • Visible Page Content

In the image below, both layers are turned on which means that the original scanned image is showing.

I added a red oiutline to some handwritten text in the document. Do you think Acrobat will recognize the handwriting? Let’s see . . .

Click the Visible Page Content eyeball to turn the layer off:

003a_visible_layer

Now, only the OCR text is visible in the document. I’ve added a red outline to show you that Acrobat did not recognize the handwritten text.

004_invisible_text_only

Correcting OCR Text in Acrobat

Acrobat makes it possible to correct OCR errors to enhance search quality. This can be a time-consuming process, but may be worthwhile when archiving high-value documents or in situations where you can identify certain documents in a case for which you want to ensure good search results.

To correct OCR in document:

  1. OCR the document or open a previously OCR’d document
  2. In the Right Hand Panel:
    1. Click in the Search field and type “Correct”
    2. Click Correct Recognized Text
      005_find_correction_tool
    3. The Correct Text function appears
      1. Enable Review Recognized text
      2. Select a suspect on the page. It will be highlighted in red.
      3. Enter the correct text for the error
      4. Click the Accept button
        006a_correction_steps

Your Corrections are Found

Tap CMD/CTRL-F to open the Find widget.

Once corrections are made, Acrobat will find the corrected text, even the text you have assigned to handwritten portions of the document:

008_it_is_found

Tips for Correcting Text

  • You can toggle “Review Recognized Text” on or off to see the original scanned text
  • You can make all corrections “mouse free”. Simply hit TAB to move the cursor to the correction text field and Enter to Accept.
  • Your document may contain artifacts such as smudges or marks which Acrobat could see as text. Simply clear the correction text field and Acrobat will show “This is not text” in the correction field:
    007_not_text
  • You can assign Preflight steps such as “Make OCR Visible” and other steps mentioned in this article to Actions which let you automate multi-step processes.

How do I hear about Adobe Security Issues?

Hackers like to target products which are ubiquitously installed, and products such as Adobe Reader, Adobe Acrobat and Adobe Flash which are installed on millions of devices around the world are likely candidates.

For solo practitioners and small organizations, Acrobat, Reader and Flash automatically turn on auto-update which helps keep your software up to date. In essence, your machine checks for updates automatically, downloads and applies them.

Enterprise (very large organizations) often prefer other means of updating our products. As a result, they want to plan ahead as much as possible and may even have dedicated security staff who assess risk to the firm.

If you are with a larger law firm or other organization, I recommend you sign up for Adobe’s Security Notification service.

Finally, I will mention that at Adobe, we take security very seriously.

All security issues are posted proactively on Adobe’s Security Bulletins and Advisories page on our website.

What Acrobat or Reader do I have?

Every so often, I get a question through my blog where it is clear that folks aren’t sure if they are using Reader or Acrobat or what version.

Although this sounds like a simple question to answer, when I’ve thought about how I would create a post to answer it, well, it made my head hurt.

Depending on how you purchase Acrobat, you will also receive access to different tracks (Classic, Continuous or potentially both). Only the Continuous track receives interim, feature bearing updates, like the ones I referred to in a recent blog post for the October 2015 release.

Fortunately, the Adobe Support folks just posted a Knowledge Base article which is extremely thorough.

Here it is!

https://helpx.adobe.com/acrobat/kb/identify-product-version.html?t1

Exporting a Multipage TIFF from Acrobat

I will be the first to admit that the title of this blog post is misleading. Acrobat has never been able to export a multipage (mtiff) file and still can’t.

However, I recently had to help a customer troubleshoot MTIFF conversion and I needed some multipage tiff files.

A bit of background, first.

TIFF is a bitmap format file type used for images. A multipage TIFF file is a single TIF file which contains multiple tif images. MTIFF files are a bit like PDF in that they contain multiple pages, but the similarity ends there.

Over the years, there have been some attempts to add other features to TIFF, but there has been a lack of industry agreement and since PDF was available (and superior IMHO), nothing really came of it.

In this blog post, I’ll show you how to export individual TIFFs of each page of a PDF file and then combine the TIFFs into a multipage (mtiff) file.

Exporting TIFF files from Acrobat

Follow these steps to export each page of your Acrobat file as a separate TIFF. Later, we will combine them.

  1. Open a PDF document in Acrobat DC
  2. Choose File> Export to> Image > TIFF
    1. Choose a destination folder
    2. Name the file
    3. OPTIONAL: Click the Settings button
      001_export_window
    4. Click the Save button

Acrobat will export each page in the PDF and number them sequentially:

03_export_list

002_settings_windowAbout the Export Settings

If you don’t click the Settings button, Acrobat will determine the colorspace of the file for you. So, if you have a color PDF, it will output a color TIFF file. Color and grayscale files are bigger than monochrome (black and white) files. Generally speaking, legal professionals convert the file to monochrome.

In the settings window, you can change several aspects of your document.

  • Monochrome (Black and White)
     CCITTG4 compression is the default and generally produces the smallest file size. This compression setting is compatible with just about anything, but ZIP compression may produce almost as small file.Some applications cannot open TIFF files that are saved with JPEG or ZIP compression. In these cases, LZW compression is recommended.
  • RGB/CMYK/Grayscale/Other
    Specifies the type of color management for the output file. For legal workflows, you can ignore this.
  • Colorspace/Resolution
    This section lets you direct Acrobat to convert the file from (e.g.) color to black and white (monochrome) or from color to shades of gray (grayscale).  You can also set the resolution of the file in dots per inch. I recommend 300 dpi for monochrome files.

NOTE: The settings are sticky so the next time you export, the file will convert the same way.

 

Combining the TIFFs to create a MTIFF

The next step is combine the single page TIFFs into a multipage TIFF. As mentioned previously, Acrobat can’t do this, but you can use the freeware programIrfanview.

IrfanView is free for non-commercial use and works on Windows Vista, Windows 7, Windows 8, and Windows 10. Just click the URL for Irfanview and then find the download link to install it.

Once you have installed Irfanview, follow these steps to combine the TIFFs output from Acrobat to a MTIFF:

  1. Choose Options> Multipage Images> Create Multipage Images
  2. In the next window:
    1. Click the Add images button to grab the TIFF files you previously exported from Acrobat
    2. Click Compression to choose Compression settings.
      I recommend CCITT Fax 4 for most monochrome legal documents.
    3. Give the file a name
    4. Click Create TIF image to save it.
      04_irfanview_options

Can’t I convert PDF to MTIFF directly? What about file size?

There are products that purport to do this. My experience with most products is that the fidelity of the file suffers. I think Acrobat does a superior job converting PDF to other formats.

Irfanview has a PDF plug-in, too, which requires Ghostscript, a Postscript clone driver. I wasn’t successful in getting it to work, but perhaps you’ll have better luck.

One thing you may find when converting PDF to TIFF is that the file size gets a lot larger. TIFF is only an image format while PDF can be vector, text, and image, with each area compressed optimally.  Depending on your source file, the MTIFF may be 2 to 100 times larger.

New Acrobat DC October Update introduces Tabbed Interface and More

Earlier this week, we shipped the Adobe Acrobat DC 2015 (October release). This new release includes some really nice new features.

Below is a run-down on just a few of the features, but you can click here for a complete list.

Tabbed Interface

I love this new feature! Acrobat now groups your documents into tabs like your web browser:

01_tabs

If you don’t like the tabbed view, you can turn it off in Preferences (CTRL/CMD-K):

02_tab_preferences

Better Sticky Notes

Sticky Notes work more smoothly than in previous versions and they look a bit nicer.

It’s easier now to reply to the comment in a Sticky note. Previously, you had to go to the fly-out menu of the Sticky Note to reply.

03_sticky_reply

Nice Combine Files

The Combine Files option has been updated with a nicer user interface. It’s easier now to expand or collapse the pages in a file and to delete the pages you don’t want to combine.

Just select the file and the controls appear on top of it.

04_combine

PDF Editing Improvements

Acrobat DC introduced really robust PDF editing enhancements and we’ve added more in the October Release.

In the October 2015 release, you can now select different bullet types and even convert bullet lists to numbered lists and vice-versa.

05_lists

Cool Keyboard Shortcut Guide for Acrobat DC

I came across this very nice  2015 Adobe Acrobat DC Keyboard Shortcuts Cheat Sheet on the Setup a Blog Today website.

The link is to a JPEG, but you could convert it to a PDF by simply opening it in Acrobat or printing it to the PDF printer.

Nice work, so I thought I would share it. Enjoy!

shortcut_guide

 

Update: Dynamic Paid and Received Stamps

I was speaking with author David Blatner at the 2015 Adobe Max conference. David is a top speaker and author on many Adobe creative products. I was surprised to hear that he was using some Stamps from this blog, but he also informed me that a previous post on Dynamic Paid and Received Stamps was missing.

Mea culpa. I had meant to update the article, but had set it to Draft status.

Here’s your fix, David!

Unlike static stamps, Dynamic Stamps use a bit of JavaScript to enter variable information.

Via this article, you can download a set of four Paid and Received stamps:

000_sample_stamps

Four Types of Stamps
I included four types of stamps in this set:
– Received Stamp with current date
– Enter your own info Received Stamp
– Paid Stamp with current date
– Enter your own info Paid Stamp

 

Below, I cover:

  • Download
  • Installation
  • How to use the stamp

Enjoy!

Download the File

Received and Paid Stamps (68K)

Make sure you download the file, don’t just view it in your browser.

Install the Stamp File

You must INSTALL the Stamps file to use it. Opening it in Acrobat won’t do anything!

You will need to be an admin on your computer to install the file.

  1. Quit Acrobat if it is already open.
  2. Copy the Review Stamps.pdf file to the User Stamps folder:

Windows
Acrobat DC
C:\Users\USERNAME\AppData\Roaming\Adobe\Acrobat\DC\Stamps

Acrobat XI
C:\Users\USERNAME\AppData\Roaming\Adobe\Acrobat\11.0\Stamps

MAC OSX

Acrobat DC
/Macintosh HD/Users/USERNAME/Library/Application Support/Adobe/Acrobat/DC/Stamps/

Acrobat XI
/Macintosh HD/Users/USERNAME/Library/Application Support/Adobe/Acrobat/11.0/Stamps/

The folders might be hidden . . .

These folder locations may be hidden on your computer, so don’t freak out if you don’t see them at first.

Here are some tips for finding them:

WIN: Open an Explorer window and paste the path into it. Change the USERNAME to your user name and hit enter.
MAC: Open your Home folder, then go to the View menu and choose Show View Options. Check Show Library Folder.

On the Mac, you will need to show your Library folder

On the Mac, you will need to show your Library folder

Another way to find the Stamps folder

An alternate way to find your stamps folder is to have Acrobat tell you where it is located. You can do this from the JavaScript debugger. Here’s how:

  1. Hit CTRL-J (Win) or CMD-J (Mac)
  2. Enter app.getPath(“user”, “stamps”);
  3. Hit CTRL-ENTER (Win) or CMD-Enter (Mac) to see the stamps path

debugger

Using the Dynamic Paid and Received Stamps

The instructions below are for Acrobat DC. For instructions for Acrobat XI, see Adobe Help

    1. In the Right Hand Pane, choose Comment
      001a_see_the_stamps

    2. Click the Stamp tool in the Stamps bar above the document window:
      002a_stamp_bar
    3. From the dropdown menu, choose Received and Paid Stamps category:003a_pull_down
    4. Stamp the document by clicking where you want to place the stamp to go.
      NOTE: if you chose one of the stamps which add custom text, a pop-up window will appear in which you can add your text:
      002_js_window

 

Sorry, No Custom Versions

Unfortunately, these stamps cannot be edited or changed. There’s “special sauce” in building them.

 

If you are really interested in building a custom dynamic stamp, check out  http://www.pdfscripting.com/ which has several dynamic stamps available and instructions for building them. Note that this is a paid website.

A new way to buy Acrobat DC: Subscription

Before going further, I need to make sure that you know that you certainly can continue to buy and upgrade Acrobat as you have in the past without buying a subscription

Subscription is a new additional purchase option for Acrobat.

Adobe has other software subscription offerings such as the Creative Cloud. The idea of subscription software is new to some folks, so I thought I would offer some background here and discuss some factors you might consider in making a decision of Buy versus Subscribe.

Note that purchase considerations will vary quite a bit between an individual or small firm and that of a large enterprise and that the opinions below are my own.

Software Licensing Models

Software is licensed using different models.

A Perpetual software license offers you the right to use the software subject to the terms of the End User License Agreement, in perpetuity.

A Subscription software license typically allows you to use the software during the term that the subscription is valid (or paid). Subscription offerings generally include all updates and upgrades that take place during the subscription term. Many subscription software offerings, including Acrobat DC and Creative Cloud, include access to other online services and products not included with the desktop software.

Perpetual May not be Forever

Note that although perpetual licenses may be used forever, that mitigating factors often dictate a shorter realistic lifespan. If you upgrade your operating system, hardware or companion software, you may need to purchase an upgrade for compatibility.

Examples

  • Acrobat 8 isn’t compatible with Windows 7
  • Acrobat XI or higher is required for compatibility with Office 2013

Another issue is support at End of Life (EOL). Adobe (and other software companies) do not support software forever. Acrobat 9 went EOL on 6-26-2013 which means Adobe is no longer providing security fixes for Acrobat 9. Although you may continue to use it, you could put yourself at risk if hackers come up with a new attack method not addressed in previous updates.

I’ll note that Acrobat X goes EOL in November, 2015.

When is a Subscription better than a Perpetual license?

A software subscription typically makes sense if you like to keep your software current or don’t want to pay the up-front cost of a perpetual license. If you value the added product offerings, then subscriptions are even more attractive. A subscription can be convenient in that your spending is predictable. You won’t have to suddenly find money to pay for a needed upgrade. Larger enterprises often prefer to have software as an operating rather than capital expense.

Dollars and Sense

I do not have a clever analogy or a calculator for the lease versus buy decision. It really depends on what is important to you. Subscriptions may be more expensive than buying perpetual software, particularly if you don’t usually buy each upgrade.

However, subscriptions offer many benefits as outlined previously and, at least in the case of Acrobat DC, bundled services which may be valuable to you.

I have provided pricing and a product release history below.

License Cost

Acrobat Pro DC costs $449 (perpetual license). An upgrade from a previous version is $199.

Product Release History

  • Acrobat 9.0 shipped 6/25/2008
  • Acrobat X shipped on 11/15/2010
  • Acrobat XI shipped on 10/15/2012
  • Acrobat DC shipped on April 7, 2015

Feature and Price Comparison

Pricing listed is for single unit individual licenses. Larger organizations often buy under an Adobe licensing program. In that case, subscriptions can be even more advantageous.

Acrobat DC: Subscription vs Perpetual Comparison

 Acrobat Pro DC SubscriptionAcrobat Pro DC Perpetual
New Purchase$14.99/month paid annually ($179.88 per year).$449 up front
Upgrade from Previous VersionN/A$199
Install Acrobat on your desktop Mac or PCYesYes
Use Adobe Send and Track to send large filesYesNo
Collect e-signatures from others and track responses in real timeYesNo
Create PDFs on the go in a browser or on a mobile deviceYesNo
In a browser, merge multiple documents in one PDFYesNo
Export PDFs to Microsoft Office formats in a browser or on a mobile deviceYesNo
Add or edit text or – rearrange pages – in a PDF on your iPad in Acrobat MobileYesNo
Release CycleContinuous release cycle with new features delivered throughoutEvery two years. You must purchase an upgrade for access to new features

Here’s a simple analysis of the cost of buying a perpetual license vs a subscription license. This assumes you want to keep your software current throughout the product life cycle.

 PerpetualSubscription
Year 1$449$179.40
Year 2-$179.40
Year 3$199 Upgrade$179.40
Year 4-$179.40
Year 5$199 Upgrade$179.40
Total$847$897

Final Thoughts

Subscription software isn’t a new idea, but it is relatively new to Adobe. The world is increasingly mobile, so having desktop software that connects to services sold as an all-in-one offering can be appealing. In addition, with subscription software, Adobe can deliver new features throughout the release cycle meaning you don’t have to wait to get access to new, productivity enhancements.

One area that definitely merits your consideration is the bundled eSignature capabilities. Acrobat DC with Services allows you to gather legal, electronic signatures on your PDF. This allows you to on-board new clients and get agreements signed anywhere, on any device with a web browser. Instead of printing, signing, scanning or mailing, your clients can move the signature ceremony securely to the cloud which improves client satisfaction. Younger clients, increasingly, expect to business without paper.

Resources

Acrobat DC Plans and Pricing

Version Comparison (DC vs X and XI)

Main Acrobat DC Page

Acrobat DC ends the dreaded “Renderable Text” Error for Scanned Docs

Acrobat (XI and earlier) sometimes confounded legal professionals during the scanning and OCR process with “renderable text” errors.

In older versions of Acrobat, if vector text was found outside of the page boundaries, Acrobat would refuse to OCR the document. Here’s the error message you would typically see:

renderable_text_error

Over the years, I found a variety of odd PDFs from fax systems or other systems that would add vector text or graphics in odd places on the page which would cause errors. At one time, I even helped a small law firm discover that the other side had deliberately embedded vector text to prevent OCR. Ah, the games that get played in discovery, but, I digress . . .

Adobe implemented a partial resolution and I wrote about the fix for the issue in Acrobat 8. This specific fix resolved the problem as long as the renderable vector elements were found within 20% of the page boundaries. However, we still found users that ran into this issue, especially with federal court files which contained vector stamps which sometimes were placed right in the middle of the page.

The good news is that Acrobat DC is can segment image layers from text layers in existing PDFs and OCR the image layer only.

To test this, I created a text comment on top of a scanned PDF, then flattened the file. Note that the text I placed is directly in the middle of the page (see below).

OCRs Just Fine!

Acrobat OCRd the scanned image layer and the document is completely searchable.

You won’t find this listed among the Acrobat DC new features, but here’s to progress.

Well, uh, it’s almost gone . . .

You might still run into the Renderable Text error if you try to OCR a document which is completely vector-based (an electronic PDF if you will).

An example of a document that will still trigger the error when you try to OCR is a text-only document created in Word and directly output to PDF.

From time to time, a customer will send me a PDF which generates the error. I often discover that the document isn’t a scanned document at all. In that case, you don’t need to OCR the document because all the text is already searchable.

 

Acrobat DC New Feature: Tools Search

Acrobat, like other business software, has a lot of tools. In most software, you have to know where to access a tool to use it. That can be frustrating if you don’t use the tool frequently.

One of my favorite features of Acrobat DC is Tools Search. Now, you can type in the name of the tool to find it.

Here’s how it works . . .

Let’s say you have some confidential information in a document which needs to be redacted (permanently removed).

Redaction tools aren’t part of the default panels in Acrobat DC, and maybe you don’t use them very frequently.

Just click your cursor in the Search Tools field:

000_ui_start

 

Then, type a few characters of the tool name. Boom! You just found the tool!

001_search_entered

 

Even though I used Acrobat all the time, I still will search for tools from time to time. It’s fast and it means I don’t have to remember where a tool is to actually use it.