Harvard Releases a Dataset that Contains a Book Collection of 394 Million Titles

Harvard University has released a dataset called Institutional Books 1.0, containing over 394 million records of library books.
The dataset includes nearly one million books in 254 languages, dating back to the 15th century, with the largest concentration from the 19th century.
The dataset is supported financially by Microsoft and OpenAI, and is part of the Institutional Data Initiative’s efforts to prepare AI collections for public access.
Librarians have been stewards of data and information for generations, and this release aims to facilitate collaboration between libraries, museums, and researchers worldwide.
The dataset was shared on the Hugging Face platform, which hosts open-source AI models that can be downloaded by anyone.

IBL News | New York

Harvard University has released a dataset of library books, named Institutional Books 1.0, for researchers, which contains over 394 million records, according to the AP.

These materials, preserved and organized by generations of librarians, comprise nearly one million books in 254 languages, dating back to the 15th century.

The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law, and agriculture.

Supported financially by Microsoft and OpenAI, the maker of ChatGPT, the Harvard-based Institutional Data Initiative is collaborating with libraries and museums worldwide on how to prepare their AI collections for the public.

“Librarians have always been the stewards of data and the stewards of information,” said Aristana Scourtas, who manages research at Harvard Law School’s Library Innovation Lab.

These datasets were shared this month on the Hugging Face platform, which hosts open-source AI models that anyone can download.

link

Q. What is the name of the dataset released by Harvard University?

A. The dataset is named Institutional Books 1.0.

Q. How many records does the dataset contain?

A. The dataset contains over 394 million records.

Q. In what languages are the books in the dataset preserved and organized?

A. The books in the dataset comprise nearly one million books in 254 languages.

Q. When were the books in the dataset first published?

A. The books in the dataset date back to the 15th century.

Q. What is the largest concentration of works in the dataset?

A. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law, and agriculture.

Q. Who are the financial supporters of the Institutional Data Initiative?

A. Microsoft and OpenAI are supporting the Institutional Data Initiative financially.

Q. What is the purpose of the Harvard-based Institutional Data Initiative?

A. The initiative is collaborating with libraries and museums worldwide on how to prepare their AI collections for the public.

Q. Who manages research at Harvard Law School’s Library Innovation Lab?

A. Aristana Scourtas manages research at Harvard Law School’s Library Innovation Lab.

Q. Where was the dataset shared this month?

A. The dataset was shared on the Hugging Face platform.

Q. What is the purpose of hosting open-source AI models on the Hugging Face platform?

A. Anyone can download these open-source AI models from the Hugging Face platform.