The AI System Uses a Huge Database of 10 Million Biological Images

Scientists have created the most comprehensive collection of images of life forms ever collected for machine learning purposes and developed an advanced vision-based artificial intelligence system to learn from this resource.

The team’s recent paper outlines significant advances in how researchers use artificial intelligence to scrutinize images of various living organisms, including plants, animals, and fungi, to explore new scientific questions, explained Samuel Stephens, principal investigator and doctoral student in computer science and engineering at State University. Ohio. “This ground-breaking tool facilitates research spanning a wide range of biological diversity,” Stevens said. “It allows for types of research that were previously unthinkable.”

Stevens and his team first assembled TreeOfLife-10M, the largest and most diverse collection of images processed for machine learning applications, boasting more than 10 million visual representations of life forms spanning more than 454,000 different groups in the biological hierarchy. This data set far exceeds the size of the previous largest machine learning-ready collection, which contained only 2.7 million images in 10,000 groups. Its unparalleled diversity is the cornerstone of the innovative algorithm.

They subsequently introduced BioCLIP, a transformative machine-learning model that became available to the scientific community in December. This tool learns well by combining both visual aspects and various textual details accompanying images, including classifications and additional data. In their evaluation, the researchers evaluated BioCLIP’s ability to correctly identify images in a large biological classification network, even including several unusual species that were not encountered during the training phase. The results showed that it outperformed existing models by 17-20% in this attempt.

The BioCLIP model is publicly available and demonstrates the ability to identify species from a single image, whether it’s an organism from the Serengeti, an animal at a nearby zoo, or local wildlife in your backyard. Unlike traditional digital methods that sort through large databases of biological images, which are typically tailored to specific tasks and less adaptable to new queries or data sets, Stevens’ tool promises increased flexibility. In addition, the universal applicability of their AI benefits biologists with a broad scientific focus, rather than those confined to specific fields of research, he added.

The effectiveness of their technique, according to Yu Su, a co-author of the study and associate professor of computer science and engineering at Ohio State, lies in the model’s exceptional ability to recognize subtle differences, recognizing the small details that distinguish organisms that may look alike, even within the same family, or mimic each other. While standard vision models can compare recognized species such as dogs and wolves, they lack the refinement needed to detect subtle differences between closely related plant species, as previous studies have shown.

Due to its fine-grained understanding, the team’s model is particularly adept at making inferences about unidentified and rare species, Su emphasized. “BioCLIP surpasses the species and taxonomic diversity offered by currently available vision models,” he stated. “It can make rational connections between unfamiliar species based on visual similarities.”

As the field of artificial intelligence advances, the study suggests that machine learning models like BioCLIP may soon play a vital role in deciphering biological mysteries that have traditionally taken longer to decode. While this iteration of BioCLIP relied primarily on public science platforms for image and data input, Stephens noted that future versions could be improved by including more comprehensive datasets from scientific institutions and museums. Such institutions can offer detailed descriptions and characteristics of species, providing rich data for further refinement of the AI model. In addition, the science labs are storing fossil records of extinct species, which the team hopes will further enhance the value of the model.

“Because classifications are constantly changing with nomenclature updates and new species discoveries, our goal is to better integrate this evolving information,” Stevens noted. “In artificial intelligence, data augmentation usually produces excellent results, and we expect to develop a more powerful and advanced model through continuous learning.”

Other posts