Lately, I've been doing a lot of reading about the origins of the "public museum"- a institution open to visitation by a general audience and sensitive to the societal needs of recreation, education, inspiration, and relevance. One of the largest categories of such museums, early on, was akin to a natural history museum: full of biological specimens, mineral samples, and other evidence of the marvels of the natural world. A major player within this particular subset of museums is the Smithsonian's National Museum of Natural History. In fact, as of 2015, this institution was the third most-visited museum in the world.  It houses five million plant specimens in its Herbarium, and has recently launched a herculean project to digitize them all, along with their documentation, to be put online in a publicly-searchable database.
The result is a massive, mostly untapped dataset. Convinced that this collection could reveal big, important things when analyzed in the aggregate (as "big data") the Smithsonian engaged data scientists to employ deep learning techniques on the digitized collection that would enable automation of sorting tasks. The published findings suggest that computers are well-equipped to handle these sorts of tedious time sucks- in this case, sorting specimens that contain mercury stains, and sorting two physically similar yet distinct plants- which have normally been performed by human beings.
"The just-published findings are a striking proof of concept. Generated by a team of nine headed up by research botanist Eric Schuettpelz and data scientists Paul Frandsen and Rebecca Dikow, the study aims to answer two large-scale questions about machine learning and the herbarium. The first is how effective a trained neural network can be at sorting mercury-stained specimens from unsullied ones. The second, the highlight of the paper, is how effective such a network can be at differentiating members of two superficially similar families of plants namely, the fern ally families Lycopodiaceae and Selaginellaceae."
The potential for this type of automated sorting seems pretty far-reaching. Archives, like the Herbarium's collections, often require a great deal of physical processing and organization (by archives staff) and eventual comparison and sorting (by researchers). It is intriguing for me to speculate how the deep learning techniques employed at the Smithsonian might translate over from a curatorial to an archival context. Could it be used to establish document authenticity or provenance? Might it make it easier to sort documents into artificial collections of datasets that reveal connections and new insights, without interrupting the physical and intellectual organization of archival collections?
1. Hetter, Katia. "And the World's Top Museum Is..." CNN.com, 16 June 2016. http://www.cnn.com/travel/article/world-top-10-museums-2016/index.html
2. Smith, Ryan P. "How Artificial Intelligence Could Revolutionize Archival Museum Research."Smithsonian.com, 3 November 2017. https://www.smithsonianmag.com/smithsonian-institution/how-artificial-intelligence-could-revolutionize-museum-research-180967065/