Harvard Releases a Dataset that Contains a Book Collection of 394 Million Titles
- Harvard University has released a dataset called Institutional Books 1.0, containing over 394 million records of library books.
- The dataset includes nearly one million books in 254 languages, dating back to the 15th century, with the largest concentration from the 19th century.
- The dataset is supported financially by Microsoft and OpenAI, and is part of the Institutional Data Initiative’s efforts to prepare AI collections for public access.
- Librarians have been stewards of data and information for generations, and this release aims to facilitate collaboration between libraries, museums, and researchers worldwide.
- The dataset was shared on the Hugging Face platform, which hosts open-source AI models that can be downloaded by anyone.
IBL News | New York
Harvard University has released a dataset of library books, named Institutional Books 1.0, for researchers, which contains over 394 million records, according to the AP.
These materials, preserved and organized by generations of librarians, comprise nearly one million books in 254 languages, dating back to the 15th century.
The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law, and agriculture.
Supported financially by Microsoft and OpenAI, the maker of ChatGPT, the Harvard-based Institutional Data Initiative is collaborating with libraries and museums worldwide on how to prepare their AI collections for the public.
“Librarians have always been the stewards of data and the stewards of information,” said Aristana Scourtas, who manages research at Harvard Law School’s Library Innovation Lab.
These datasets were shared this month on the Hugging Face platform, which hosts open-source AI models that anyone can download.