哈佛大学开源 AI 训练数据集“Institutional Books 1.0”，涵盖馆藏 98.3 万本图书

作者：漾仔 2025-06-16 10:27

哈佛大学法学院图书馆开源首个AI训练数据集“Institutional Books 1.0”，涵盖98.3万本图书、2420亿Token，支持245种语言。40%为英语，60%为其他语言，书籍主要来自19-20世纪。未来还将加入数百万份历史报纸数字化内容。#AI训练# #开源数据#

在微软与 OpenAI 的支持下，哈佛大学法学院图书馆于上周正式开源其首个 AI 训练用开放数据集“Institutional Books 1.0”。该数据集据称收录了哈佛大学馆藏中 98.3 万本图书，涵盖 245 种语言，共包含 2420 亿个 Token，AI在线附项目地址（https://huggingface.co/datasets/institutional/institutional-books-1.0）。

据介绍，相应数据集收录的书籍有 40% 为英语，书籍主要出版年代集中于 19 与 20 世纪，共计被划分为 20 项主题，除此之外，相应数据集还提供了每本书的完整元数据，涉及“作者、出版年份、语言、原始来源”等信息。

哈佛大学法学院图书馆表示，未来研究人员还将持续扩充数据内容，目前相应项目组成员已与波士顿公共图书馆展开合作，将把“数百万份”历史报纸以数字化形式添加至上述数据集中。

后续，哈佛大学法学院图书馆还计划开发一系列 AI 工具，以提升馆藏资料整理和开放的效率，推动“负责任的数据使用规范”。

Authors' Class-Action Lawsuit Against Anthropic: Accusing It of Stealing Millions of Books to Train AI

A federal judge in California recently ruled that three authors can file a class-action lawsuit on behalf of all American authors who have had their works downloaded from a pirated library, accusing AI company Anthropic of copyright infringement. The lawsuit was filed by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson in August last year.

7/18/2025 2:52:28 PM AI在线

Group of Writers Sue Anthropic: Accuse of Unauthorized Use of Millions of Books to Train AI, Copyright Dispute Escalates!

Recently, three authors have jointly filed a class-action lawsuit against AI company Anthropic, accusing the company of using a large number of pirated books to train its AI model without permission. This lawsuit has attracted widespread attention and has also prompted new reflections on the relationship between AI technology and creators.According to reports, the authors claim that Anthropic used millions of e-books, including some copyrighted works, without authorization when training its AI system.

7/21/2025 4:22:44 PM AI在线

OpenAI开源客户服务代理框架加速企业级AI应用落地

人工智能研究机构OpenAI于6月18日正式开源其客户服务代理框架，标志着该公司在企业级AI解决方案布局上迈出重要一步。该框架通过提供透明工具链和清晰实施案例，旨在推动智能代理系统从实验室走向实际商业应用。技术细节与核心优势.

6/19/2025 2:01:33 PM AI在线

哈佛大学开源 AI 训练数据集“Institutional Books 1.0”，涵盖馆藏 98.3 万本图书

相关资讯

Authors' Class-Action Lawsuit Against Anthropic: Accusing It of Stealing Millions of Books to Train AI

Group of Writers Sue Anthropic: Accuse of Unauthorized Use of Millions of Books to Train AI, Copyright Dispute Escalates!

OpenAI开源客户服务代理框架 加速企业级AI应用落地

OpenAI开源客户服务代理框架加速企业级AI应用落地