News

EEBO now in TypeWright

EEBO in TypeWright

We are pleased to announce that the Mellon-funded Early Modern OCR Project – eMOP – has completed running Optical Character Recognition Software on the 138,538 documents in ProQuest’s Early English Books Online (EEBO), and we are now making almost all of them available in 18thConnect.org for correcting the OCR. Some document images were too poor to run through the software, but we have loaded the resulting “dirty OCR” for 113,909 documents into the TypeWright tool at 18thConnect.org for crowd-sourced correction (http://www.18thconnect.org/typewright/documents).

eMOP Mellon Final Report

eMOP Releases its Full Set of Early Modern Typeface Training for Tesseract

In accordance with Andrew W. Mellon Foundation grant requirements and IDHMC guiding principles, the Early Modern OCR Project has released all of the Early Modern Typeface Training we created for use with the Tesseract OCR engine.

More Early Modern Word Lists Released by eMOP on Github

The eMOP team is happy to announce the release of more early modern word lists, which we have compiled, cleaned, and combined over the last 2 years. Our sources include Ted Underwood, Martin Mueller, Loretta Auvil, the VARD project, and the TCP transcriptions of EEBO and ECCO. Please see our Github page for more information.

SAA 2014 Pre-Conference Workshop - OCRing with Open-Source Tools

The slides for our 1-day pre-conference workshop on OCR'ing with Open Source Tools, given at the Society of American Archivists 2014 Annual Conference in Washinton, DC on August 12.




Or download the original Powerpoint slides.

Early Modern Word List with Variant Spellings

The eMOP team is happy to release the early modern word list we've compiled by parsing the 46,000 TCP transcriptions of EEBO & ECCO documents and combining it with the alternate spelling list available via the VARD tool.

eMOP @ DH2014-Lausanne: eMOP and the Cobre Tool

A presentation from DH2014-Lausanne discussing distributed reading, crowdsourcing and the Cobre tool as used in eMOP.

eMOP @ DH2014-Lausanne: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

A presentation at DH2014-Lausanne discussing some of the problems faces during the eMOP project, and by eMOP project collaborators on other large projects. We discuss how changes are dealt with in large DH projects.

eMOP @ DH2014-Lausanne: eMOP Poster

The poster we presented on eMOP at DH2014-Lausanne. It was a huge success and we talked to many people interested in the work we are doing and the tools we've created.

eMOP @ DH2014-Lausanne: eMOP Post-OCR Triage

A presentation at DH2014-Lausanne of the eMOP (at the IDHMC at Texas A&M) on our post-processing triage method along with our expanded treatment and diagnosis queues for correcting and analysing Tessearct OCR results.

eMOP @ DH2014-Lausanne: eMOP Book History Tools

A presentation at DH2014-Lausanne of the Tesseract training methods and tools developed by eMOP (at the IDHMC at Texas A&M), and their potential application for other book history and typeface history research projects.

TCDL 24x7 Presentation on eMOP Workflows (April 28,2014)

This is 24x7 presentation made to the Texas Conference on Digital Libraries, in Austin TX. 24 slides in 7 minted. A very fast presentation of eMOP based on our workflows.

TxDHC Presentation of eMOP Workflows (April 11, 2014)

A text outline of a presentation given at the Texas Digital Humanities Consortium's (TxDHC) Conference at the University of Houston, April 10-12, 2014. This presentation provides an overview of the OCR training, OCRing, and post-processing analysis and correction processes being done by eMOP through a series of workflow diagrams created over the life of the project.

1 comments

Follow eMOP at the 1st annual Texas Digital Humanities Consortium Conference via Twitter @ Storify:https://storify.com/EMGrumbach/emop-at-txdhc

Historical Typemaking and its Artifacts

In late 2013, Todd Samuelson traveled to Europe in search of typographical specimens for the eMOP initiative. In a series of dispatches, he will highlight his findings and discuss the significance of historical research in the development of the project.

eMOP Mellon Interim Report

Prepared by PI and IDHMC Director, Dr. Laura Mandell, and eMOP Co-Project Managers for year two, Matthew Christy and Elizabeth Grumbach, the following post contains the Mellon Interim Report for the Early Modern OCR Project.

Special Characters, Unicode, and Early Modern English

With a dataset of 45 million page images, the eMOP team is dealing with a lot of text output, and that means dealing with Unicode. As an early modern English project, we're also working with ligatures and other special characters specific to the period, and that means considering the MUFI (the Medieval Unicode Font Inititiave).

October OCR Testing & Training

eMOP progress continues as our team experiments to find the best method for training Tesseract to recognize various early modern fonts. The new Franken+ tool, developed by eMOP graduate student Bryan Tarpley, has passed through the alpha testing phase and dramatically improves our ability to create a variety of training sets for Tesseract. Now we're hard at work investigating various methods for creating “training sets,” for Tesseract to see what will give us the best OCR results.

eMOP's Zotero Page of OCR Readings

A eMOP library exists under the IDHMC Group in Zotero. It contains a variety of readings related to OCR in general and Tesseract in particular. Come check it out (at eMOP Zotero Library) and peruse our collection of OCR-related readings. You'll never want to know more than this about OCR.

This Fall on eMOP: Post Processing

 

In the near future, we intend to write up a post detailing our successes and goals for this fall, but we'd like to immediately share an interesting development at the beginning of Year Two. As our team and collaborators begin thinking towards the post-processing and triage stage of this project, we've been having a series of meetings here to rethink the granularity of our diagnostics and triage approach.

KB National Library of the Netherlands posts on eMOP

KB National Library of the Netherlands has recently given the Early Modern OCR Project some publicity on the other side of the Atlantic. Koninklijke Bibliotheek (KB) coordinates one of our international partner projects, IMPACT: Improving Access to Text.

eMOP Featured in Library Journal


Matt Enis, Associate Editor of Technology for the Library Journal, asks "OCR [optical character recognition] works great for paperbacks—but what about 15th Century texts set by hand?"

ProQuest Joins Forces with TAMU Scholars to Make 15th Century Books Behave Like Born-Digital Text


ANN ARBOR, Mich., November 6, 2012 - Information powerhouse ProQuest is participating in a project that will vastly accelerate research of 15th through 17th Century cultural history. The company will provide access to page images from the veritable Early English Books Online and newcomer Early European Books to the Early Modern OCR Project (eMOP) at Texas A&M. EMOP will use the content to create a database of typefaces used in the early modern era, train OCR software to read them and then apply crowd-sourcing for editing. The project will turn the rich corpus of works from this pivotal historical period into fully searchable digital documents.

eMOP Receives Funding from Andrew W. Mellon Foundation

English Professor Laura Mandell, Director of the Initiative for Digital Humanities, Media, and Culture (IDHMC), along with two co-PIs Professor Ricardo Gutierrez-Osuna and Professor Richard Furuta, are very pleased to announce that Texas A&M has received a 2-year, $734,000 development grant from the Andrew W. Mellon Foundation for the Early Modern OCR Project (eMOP, http://emop.tamu.edu ).  The two other project leaders, Anton DuPlessis and Todd Samuelson, are book historians from Cushing Rare Books Library.