AI在线 AI在线

哈佛大学开源 AI 训练数据集“Institutional Books 1.0”,涵盖馆藏 98.3 万本图书

哈佛大学法学院图书馆开源首个AI训练数据集“Institutional Books 1.0”,涵盖98.3万本图书、2420亿Token,支持245种语言。40%为英语,60%为其他语言,书籍主要来自19-20世纪。未来还将加入数百万份历史报纸数字化内容。#AI训练# #开源数据#

在微软与 OpenAI 的支持下,哈佛大学法学院图书馆于上周正式开源其首个 AI 训练用开放数据集“Institutional Books 1.0”。该数据集据称收录了哈佛大学馆藏中 98.3 万本图书,涵盖 245 种语言,共包含 2420 亿个 Token,AI在线附项目地址(https://huggingface.co/datasets/institutional/institutional-books-1.0)。

哈佛大学开源 AI 训练数据集“Institutional Books 1.0”,涵盖馆藏 98.3 万本图书

据介绍,相应数据集收录的书籍有 40% 为英语,书籍主要出版年代集中于 19 与 20 世纪,共计被划分为 20 项主题,除此之外,相应数据集还提供了每本书的完整元数据,涉及“作者、出版年份、语言、原始来源”等信息

哈佛大学法学院图书馆表示,未来研究人员还将持续扩充数据内容,目前相应项目组成员已与波士顿公共图书馆展开合作,将把“数百万份”历史报纸以数字化形式添加至上述数据集中。

后续,哈佛大学法学院图书馆还计划开发一系列 AI 工具,以提升馆藏资料整理和开放的效率,推动“负责任的数据使用规范”。

相关资讯

Authors' Class-Action Lawsuit Against Anthropic: Accusing It of Stealing Millions of Books to Train AI

A federal judge in California recently ruled that three authors can file a class-action lawsuit on behalf of all American authors who have had their works downloaded from a pirated library, accusing AI company Anthropic of copyright infringement. The lawsuit was filed by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson in August last year.
7/18/2025 2:52:28 PM
AI在线

Group of Writers Sue Anthropic: Accuse of Unauthorized Use of Millions of Books to Train AI, Copyright Dispute Escalates!

Recently, three authors have jointly filed a class-action lawsuit against AI company Anthropic, accusing the company of using a large number of pirated books to train its AI model without permission. This lawsuit has attracted widespread attention and has also prompted new reflections on the relationship between AI technology and creators.According to reports, the authors claim that Anthropic used millions of e-books, including some copyrighted works, without authorization when training its AI system.
7/21/2025 4:22:44 PM
AI在线

OpenAI o3 模型运行成本估算大幅上调:单次任务从 3000 美元涨至 3 万美元

Arc Prize Foundation 大幅上调 OpenAI o3 模型运行成本估算,从 3000 美元涨至 3 万美元。高昂成本凸显 AI 模型特定任务的高成本难题,控制成本成行业挑战。##AI模型成本##
4/3/2025 7:59:45 AM
远洋
  • 1