Institutional Data Initiative at Harvard Law School Library

How Knowledge Institutions Can Build a Promethean Moment

Why we’re launching the Institutional Data Initiative to work with libraries, government agencies, and other knowledge institutions to develop data collections and best practices for artificial intelligence.

By Greg Leppert — December 12 2024, Cambridge MA.

Advances in AI are creating immense interest in high-quality data only found deep within archives. This new interest can help institutions make such data available to everyone. Today we’re launching the Institutional Data Initiative (IDI), a research initiative at the Harvard Law School Library. IDI is dedicated to supporting our peers as they steward humanity’s knowledge and seek to provide the broadest access to it in the age of AI, just as they’ve done for so much media over centuries, and across the technological revolutions within them.

IDI comprises a growing team of data scientists and community builders, first incubated at the Library Innovation Lab. We’ll collaborate with knowledge institutions—from libraries and universities to cultural groups and government agencies—to help structure, analyze, and publish their collections as data for all uses, including AI. We’ll work to develop AI-driven tools to scale and accelerate this work, evaluations to study its impacts, and best practices to foster responsible data use while affirming institutional stewardship.

Our initial activities include refining a collection of nearly one million public domain books, scanned at Harvard Library; a collaboration with Boston Public Library to make available millions of pages from hard-to-find historical newspapers; and a spring symposium hosted at Harvard Law School to build connections and explore areas of alignment between the institutional and AI communities.

Our goals

Through the development of broadly accessible and well-understood collections of data, we aim to align the interests of model-makers to have no data left behind with the interests of knowledge institutions to offer thorough and representative windows that anyone can peer through.

Institutions steward vast and unique collections of knowledge, but much of it is still waiting to be made accessible. With the emerging capabilities of AI and contributions from the AI community, we believe significant progress can be made in making this information accessible, including for traditional patron access. This type of collaborative work also creates an opportunity for knowledge communities to make the most of their deep experience in areas to which the AI community is sometimes newly arrived—from the political nature of information classification, to accounting for cultural context, to privacy and attribution frameworks.

For the AI community, including those working on open source models, there’s immense benefit to increasing access to these collections. Increased access can lower the barrier to entry for model creation, allowing more diverse groups to have a hand in building and tuning them. It can increase language and cultural representation, allowing models to serve a broader reach of humanity. It can open the door to new capabilities, including scientific and medical discovery. And access to knowledge, coupled with ongoing stewardship, may hold a key to safe and transparent AI systems.

Our projects

At launch, we have data from nearly one million public domain books, scanned at Harvard Library as part of the Google Books project. Our structuring and analysis of the corpus is complete and we’re working with Google to release this treasure trove far and wide.

We’re also collaborating with Boston Public Library as they scan millions of pages from public domain newspapers. The layouts of newspapers make extracting their text notoriously difficult, so we’re applying new methods to increase accuracy and accessibility. Once extracted, we’ll research the impact this data has on the behavior and information recall of AI models so that other institutions can better understand the potential of their own collections.

These are collections of longform text, but we’re actively seeking collaborations across all forms of data, including scientific and biomedical. We’re prioritizing open releases, but institutional missions are not homogenous, nor are data types, cultures, or rights frameworks, and finding principled ways to navigate them is part of our mission.

Our community

This spring, we’re hosting a symposium at Harvard Law School to help bridge the institutional and AI communities. We’re inviting knowledge institutions, model-makers, AI researchers, and other academics to explore the potential of AI to advance the role of institutions as anchors of knowledge access and stewardship, now and in the future.

If you’re part of a knowledge institution, we’d like to hear how we can help in your mission. If you’re a model-maker or AI/ML researcher, we invite your contributions, including on our current work in flight. And if you’re an academic—a digital humanist, for example—we’d welcome the chance to integrate your work and perspective.

Reach out and join us.

Our support

IDI’s launch is generously supported by gifts from Microsoft and OpenAI, and for the longer term we’re assembling a mosaic of philanthropic and industry supporters as diverse as the knowledge we seek to advance. As AI grows, it should sustainably empower our longstanding societal knowledge institutions to thrive along with it.

We formed the Institutional Data Initiative because the means by which we, as a society, understand what we know and how we know it are at a crossroads. Centuries ago, we built trusted institutions to begin stewarding humanity’s most important knowledge, much of it now represented as data. Today, as the world looks for ways to guide the path of AI toward human thriving, data is everything.

Greg Leppert
Executive Director, Institutional Data Initiative

“Libraries and other stewards of humanity's aggregated knowledge often think in terms of centuries — preserving and providing access to their treasures both for well-known uses and for aims completely unanticipated in ancient times or even recently. AI training falls into the latter category, including the large language models that draw upon nearly countless artifacts. With IDI, we aim to address newly-energized interest from those quarters in otherwise-obscure and sometimes-forgotten texts in ways that keep knowledge institutions', and society's, values front and center.”

Jonathan Zittrain — Faculty Director, Institutional Data Initiative

“Libraries are uniquely positioned to provide open and public access to the knowledge that powers scholarship and innovation. As stewards of the public domain and curators of diverse, trustworthy collections, we have the foundational materials needed to train inclusive AI systems. Through initiatives like IDI, we aim to partner in shaping the ethical use of those materials in emerging systems, to ensure they reflect the breadth and depth of human knowledge for the benefit of all.”

Martha Whitehead — Vice President for the Harvard Library and University Librarian, Harvard University.

“At the Boston Public Library, we’re seeing AI become embedded in the platforms we and our patrons use every day. The launch of IDI represents a critical opportunity for libraries to ensure that, as these AI technologies advance, development is grounded in the depth, diversity, and complexity of human knowledge and cultural heritage. We don't want to simply react to what could be a fundamental shift in how people interact with information while model builders and vendors make all of the hard decisions. Partnering with IDI gives us the chance to be active participants by contributing to an open data pipeline while exploring the potential and limitations of AI for furthering the discovery, use, and preservation of library collections.”

Jessica Chapel — Chief of Digital & Online Services, Boston Public Library

“Libraries have stewarded data for centuries and our community continues to look to new ways to support society's trust in knowledge sources. We are looking forward to participating in the Institutional Data Initiative as we develop our approaches to working with AI.”

Richard Ovenden — Bodley’s Librarian, The Bodleian Libraries at the University of Oxford

“Microsoft is proud to support the establishment of the Institutional Data Initiative, which will work to increase access to knowledge and high-quality data for all builders of AI. We are committed to enabling broad access to data and empowering a more inclusive AI ecosystem. Since 2020, we have worked to close the data divide, ensuring that every organization has access to data to innovate and achieve more, which is essential to growing a vibrant, competitive AI economy.”

Burton Davis – Vice President and Deputy General Counsel, Microsoft

“Academic institutions have long been key partners in artificial intelligence research and progress, and Harvard’s Institutional Data Initiative is a powerful example of this. The public domain plays a vital role in the spread of knowledge and creativity, and OpenAI is delighted to support this effort. We are inspired by Prof. Zittrain’s leadership throughout this important project and are eager to see its impact.”

Tom Rubin — Chief of Intellectual Property and Content, OpenAI

“Google is proud to support the Institutional Data Initiative’s work sharing humanity’s knowledge with the world — a goal that complements our mission of organizing the world’s information and making it universally accessible and useful. New technologies expand the incredible value of the public domain, including for artificial intelligence research and development. And private sector organizations can play a vital role alongside universities, libraries, archives, and museums in serving the public interest, much like the Google Books project has done for more than twenty years.”

Kent Walker — President of Global Affairs, Google & Alphabet