- About | People | Projects
- Presentations | Publications
- Exhibits | Printable Materials
- Goals & Reports | News | Giving
- UF Digital Collections (UFDC)
- Digital Library of the Caribbean (dLOC)
- Caribbean Newspaper Digital Library (CNDL)
- Florida Digital Newspaper Library
- Institutional Repository @UF
- Related Libraries
Digital Library Center
Smathers Libraries
University of Florida
P.O Box 117003
Gainesville, FL 32611 USA
P: 352.273.2900
F: 352.846.3702
DLC@uflib.ufl.edu
Workflow in the DLC, Featuring Prime Recognition™ Software
Overview
Going from letters on the printed page to online searchable text involves the following steps:
- Digital Scanning: Converting the original materials into image files
- Quality Control (QC): Inspecting the images and creating metadata
- Optical Character Recognition (OCR): Converting image files into text files
- Markup: Applying machine-readable metadata code to our content
Image Creation and Quality Control
Once the digital scanning has been completed, along with the necessary quality control of the digital images for image quality and skew, Prime OCR conducts image zoning if the target data is arranged in columns or tables.
Text Creation
Plain-text files are created from the TIFF image files by means of optical character recognition (OCR). Alternative to OCR: lots of typing.
| Original Image File (TIFF) | Plain Text File (TXT) |
|---|---|
|
Shingles- Manufacturers of. DIXON NICHOLAS, First av c Miller (for ad. see index) Silver and Silver Plated Ware. AYRES C. L., Franklin c Jackson (for ad. see index) Skating Rinks- Roller. Jackson c Morgan Charles Parcell, prop. |
- About OCR: The branch of computer science that deals with extracting text from an electronic image file.
- Preprocessing (deskew, despeckle)
- Algorithms for character modeling
- Lexical checking
- Newsgroup: comp.ai.doc-analysis.ocr.
- About & Running PrimeOCR™
Markup
Applying markup to the textual product of OCR comprises three topics, in order of application:
- Named Entity Recognition
- TEI Header
- DLXS XPAT SGML
Text Quality Control
Prime Recognition™'s output has greater than 99% accuracy, which reduces the amount of time required to spend on quality control. Still, we currently proofread the tables of contents in the SGML file.
