Searching OCR text

Nearly all 180,000 UAM Herbarium Vascular Plants (ALA) specimens have been imaged, but only about half have pre-existing, parsed data. OCR (Optical Character Recognition) processing is well underway, and this is available as single text strings representing all of the text recognized by the OCR program within an image. (Experimentation with parsing this text into standard fields is also underway.) All specimens have at least a crude "folder name," as a taxonomic name within their standard data field.

Particularly useful for the half of the collection for which parsed data is unavailable, you can now locate specimens (and their images) in Arctos by searching the raw OCR results. Many specimens have only taxonomic information, so combining OCR criteria with other criteria (such as geography or collectors) is likely to exclude possible matches. Additionally, there are many uncorrected errors in the raw OCR text, so short queries are more likely to be successful. In other words, a taxonomic criterion plus an OCR criterion is most likely to produce useful records.

From the Arctos SpecimenSearch page (http://arctos.database.museum/SpecimenSearch.cfm), click "Show More Options" in the Biological Individual pane.


Enter your search criteria in the OCR Text box, and click Search.

All matching specimens will be returned. Click the catalog number to go to Specimen Detail, where you may view the raw OCR text.

Data Loans

[ moved to https://arctosdb.wordpress.com/documentation/loans/#dataloan ]
Data loans document data usage, and are generally used when a project downloads data from Arctos without examining specimens. Data loans form a special relationship between a loan and a cataloged item, rather than a loan and a specimen part. Data loans are not meant as a replacement for "digital" loans, in which a specimen part is imaged (or otherwise digitized), as "digital" loans concern physical objects and handling specimens. Subsequent usage of digital media (including that generated in "digital" loans) may best be recorded as data loans. Curators may wish to create a new loan number series for data loans, although this is not required.

This entry documents creation of a data loan for illustrative purposes.

  1. Found publication vaguely citing Arctos
  2. Created publication agents in Arctos
  3. Since the available PDF was a reprint, used the DOI to look up original publication information (http://www.google.com/search?q=DOI%3A+10.1111%2Fj.1472-4642.2008.00547.x)
  4. Created Publication in Arctos
  5. Added Media to the publication
  6. Created Arctos loan of type "data"
  7. Downloaded data loan template
  8. Searched Arctos for scientific names cited in publication
  9. Downloaded results, copied catalog numbers to data loan template.
  10. Filled in rest of values in data loan template, copy/paste to all cells. Save as CSV.
  11. Uploaded to data loan loader, clicked OK a couple times.
  12. Created project, added loan, publication, and media created for publication

Total time: ~10 minutes, mostly spent researching and creating Agents.

Result: http://arctos.database.museum/project/different-climatic-envelopes-among-invasive-populations-may-lead-to-underestimations-of-current-and-future-biological-invasions

The collections used, even though there was no formal loan request and no physical specimen usage, receive quantifiable credit for specimen data used. Future Hieracium added to Arctos will not be included in this loan, so it will be possible to quickly identify specimens which could not have been used, even though the lack of citations in the paper makes it impossible to determine which specimens were actually used. Additionally, if current Hieracium specimens are later determined to be some other species, those data will remain as part of the loan, perhaps explaining yet-undetected anomalies in the publication.