eMOP Quick Start Guide for Aletheia and Franken+
Aletheia and Franken+ are key tools used by the eMOP team to create early-modern typeface training for Tesseract. Here are some quick tips on getting started with using them. They both have much more functionality though so continue to explore these two great tools.
Aletheia Quick Start
Aletheia is a ground truth creation tool developed by eMOP collaborators PRImA Research Labs at the University of Salford, UK. For eMOP we are using it to identify the glyphs on a set of high-quality page images printed in a typeface with which we want to train Tesseract.
- Open Aletheia tool
- Click on Start New Document - Image File
- Find tiff image file
- Continue without B&W
Process Image
- Click on Image tab
- Click on Threshold in Binarization window
- Go with default, or try experimenting with different settings
- Click on Remove Noise
- Go with default, or try experimenting with different settings
- Click save B/W. Save this file in a separate folder from the original file image. Maybe a subfolder?
Find Glyphs
- Click on Regions tab
- Click on Analyze Page in Auto window
- Select Glyphs, words, lines, and regions with text in Analysis Depth window. You may have to close and reopen the B/W image to run analysis
- Click on Glyphs tab to see individual glyphs
Correcting Text
- Check Text Overlay box in Text window
- Move cursor (hand) over a line to see it's text, or click Select all in Actions window to see all
- If you see something wrong click on the box around that glyph
- Click on Text Content in Text Window (or F11)
- Erase the old value and enter the correct text in this window. Common ligatures and special characters are available for selection at the bottom or you can enter your own Unicode value. Use Google or check here for Unicode values: http://www.russellcottrell.com/greek/utilities/UnicodeRanges.htm or the MUFI data set
- Click Save
Correcting Boxes
- Click Edit in Basic Tools window to resize or move existing boxes
- Click Select in Basic Tools to delete existing boxes
- Click Rectangle in Draw Contour window to draw a new box
- Use steps above to assign text value to new glyph box
Saving XML
- Click on the Save As... icon at the top left. Be sure to save with the same name as the b/w image, and in the same folder, with .xml extension
Franken+ Quick Start
Franken+ is a tool developed at the IDHMC for eMOP which allows us to select just the best exemplars of the glyphs discovered by Aletheia to create ideal early-modern typeface training for Tesseract.
Create a Language
This is not related to the actual language you're using, it's just a reference to your font training.
- Click on Create Language and give yours a name. 3-4 letters is the standard
Create a Font
- Click the Create button under Font:
- In the new window enter a name in the Font Name box
- If it's an italic font, click the Italic box. If it's a blackletter (gothic) font, click the Fraktur box
- Click Save Font
Ingest Aletheia files (TIF/XML pairs)
- Make sure that your Language and Font fields at top are selected
- In Aletheia TIF/XML box enter the path to a folder that contains the set of Aletheia files you created
- Click Ingest Glyphs button
Edit Font
- Set the Language and Font fields
- Click on edit button
- In the new window click on the pull down menu under Glyphs
- Using the Display Size slider at the bottom to get a better view of the letters
- Usually it's easiest to click the Remove All button and then click on the exemplars you like to add them back. We've found 5 is a good number (when available: I'm looking at you Q, X and Z)
- But if you see some exemplars that are simply wrong, or so bad they're not worth keeping, then it's best to first choose those and hit Delete Removed. a smaller set of glyphs is easier to edit and cleaning out the DB helps its performance
- Alternatively, you can reclassify glyphs that are of good quality but have been assigned the wrong Unicode value
- Select the glyph, then click the Reclassify Glyph button above the editing window
- You can either assign it to a new glyph that already exists in your set with the pull down menu, or assign it a Unicode value if it isn't in your set
- You can also edit the images if you have Photoshop on your computer, but be careful about doing that
- Right click on the exemplar and select Edit Image
- When you're done click the Save button
CreateTIFF/Box File Pairs
Select a text file to transcribe using your new training set in the Synthesize TIF/Box Pair area. We have one we use that includes all of the special characters and ligatures we've encountered in our work with early modern documents. It's here: F+TraininigText.txt. But you may want to make your own.
- Put in the path to the text file.
- Click Create TIF/Box Pairs
Create Tesseract Training
- Click the Train Tesseract button
- Pick the font (or fonts for a combo) from the list
- Add any dictionary or ambiguity files
- Click Make Library button
- Go get a coffee or see a baseball game