View email in your browser
New Search Feature:
Optical Character Recognition (OCR)

With more than 92 million pages of digitized records available to search in the National Archives Catalog, we are always working on ways to improve search results to better help you find what you’re looking for.

That’s why we’re excited to share a new feature in the Catalog: Optical Character Recognition, or OCR. 

PV2 Ruben Gutierrez a general mechanic, assigned to the Headquarters and Headquarters Battery, 32nd Battalion, 31st Brigade, on sentry duty uses his binoculars to scan the horizon for the approaching vehicles. The binocular lens reflects a fellow soldier's image at the unit's deployed camp. The soldiers are supporting the Air Defense Artillery's Patriot missile batteries during world's largest joint service, multi-national tactical air operations exercise, 4/21/1997
PV2 Ruben Gutierrez a general mechanic, 4/21/1997, National Archives Identifier 6504747
What is OCR? 
OCR converts images that contain typed, handwritten, or printed text into text that can be read and searched by a computer. 

Previously, records in the Catalog were only searchable based on the titles, descriptions, and other fields entered by archivists, or by tags and transcriptions entered by citizen archivists. Now, with OCR capability, text from some images in the Catalog can be extracted, making that text searchable and more likely to come up in your search results. 
 
Currently, the Catalog’s new OCR engine is applied to records in either JPG or PDF format added to the Catalog since June 2019. NARA is exploring how to retroactively process records from before that point, but right now this feature applies to millions of pages! 

Here’s what you can expect and how it works:

A search for “Melvin H. Coulston” returns this Bureau of Indian Affairs record with OCR data. The search term is bolded.  

SEARCH TIP - If you are searching for a name or phrase, surround it in quotation marks to do an exact phrase search.
Image of search results in Catalog
To explore the results further, click on the blue description title. On the description page that follows, you can see the pages where your search term is found. They are listed below the description title and to the left of the image viewer:
Image of Item description in Catalog displaying OCR search results
Likewise, you may also see the page thumbnails highlighted beneath the image viewer that contain your search term. Clicking on any of the pages in the list or a highlighted page thumbnail will take you to that page.
Image of item description in Catalog showing highlighted thumbnail image
Try it out!
You can test the capability yourself by running one of the following searches and clicking the first result returned for each:
We still have work to do! Right now, we are investigating options to re-process items for OCR that were in the Catalog prior to June 2019. Additionally, records that are only available in PDF format currently do not provide the page jumping or highlighting capability.

OCR is not perfect! While this technology helps to make records more searchable, we still find human-entered transcriptions to be more accurate than OCR, so we still need your help as citizen archivists to transcribe records in the Catalog, and help decipher that tricky handwriting! 
In case you were interested...
Technical specs: NARA’s new OCR engine is powered by the open source Tesseract software. As records are added to NARA’s Amazon Web Services (AWS) S3 cloud storage, it is run through image processing powered by a series of AWS Lambda functions.

News from the Innovation Hub:

500,000 pages scanned by Citizen Archivists!

Citizen Scanners in the National Archives Innovation Hub have officially scanned 500,000 pages of historical records! 

Following scanning, each digitized page is then added to the National Archives Catalog for anyone to view and download. Currently, nearly 490,000 pages of records scanned in the Hub are available in the National Archives Catalog. You can view all records scanned by citizen contributors in the Catalog so far.

Here is the 500,000th page scanned in the Innovation Hub:
This is a form that Antonio Dardell filled out to apply for an increase in his Civil War veteran's pension. Antonio Dardell is an interesting soldier, being one of the small number of Chinese American soldiers who fought in the American Civil War.
 
Dardell had likely been adopted as a young boy by an American whaling ship captain, and brought to Connecticut. He enlisted in the 27th Connecticut Infantry Regiment at age 19. After the war he worked as a tinsmith, and in 1880 he became a naturalized citizen, showing his honorable discharge certificate in lieu of a Declaration of Intent. More information can be found in this "The Blue, the Gray, and the Chinese" blog post.
 
This pension file was scanned as part of a two-week project by two citizen scanners who were interested in digitizing records relating to Asian American history. In addition to Antonio Dardell's pension, the citizen scanners also worked on his Compiled Military Service Record, and the CMSRs and pensions of other Chinese American soldiers.
 
Many thanks to everyone who's worked in or adjacent to the Hub now or in the past! Here's to the next 500,000 pages.

Now that these pages are scanned and added to our Catalog, help transcribe them to make them more searchable! Our latest citizen archivist mission features records scanned by citizen contributors in our Innovation Hub.
Get started transcribing!
Interested in Citizen Scanning? Learn more about the National Archives Innovation Hub in Washington, DC.
Questions or comments? Email us at catalog@nara.gov.
National Archives logo
Privacy policy
Subscribe or Unsubscribe
Powered by Mailchimp