file_id
stringlengths 10
13
| ocr
stringclasses 52
values | title
stringlengths 2
1.22k
| date
stringlengths 4
9
| author
stringlengths 2
112
| page_count
int64 3
7.07k
| word_count
int64 0
2.12M
| character_count
int64 39
13M
| complete_text
stringlengths 39
13M
|
---|---|---|---|---|---|---|---|---|
bpt6k55771383 | 100 | Des idées napoléoniennes | 1860 | Napoléon III (1808-1873 ; empereur des Français) | 676 | 43,267 | 266,537 | "\n \nDES IDEES \nNAPOLÉONIENNES. \n \nPARIS. TYPOGRAPHIE DE HENRI PLON, \nIMPRIMEUR DE L'EMPEREUR,(...TRUNCATED) |
bpt6k6421291j | 100 | Catalogue du Musée Rath à Genève | 1870 | None | 110 | 22,067 | 131,113 | "\nio \n \n \n \n \n \n \n \n \nCATALOGUE \nDU \nMUSÉE RATH A GENÈVE \nGENÈVE \nIMPRIMERIE JULES-(...TRUNCATED) |
bpt6k6138136q | 96 | Rêves et devoirs | 1873 | Froment, Théodore (1839-1901) | 676 | 27,722 | 159,481 | "\n \nTHÉODORE FVlOdMEVsÇT \nRÊVES ET DEVOIRS \nO mes amis, l'enfance aux riantes couleurs \nBonn(...TRUNCATED) |
bpt6k57852716 | 92 | "Adresse de l'assemblée provinciale de la partie du Nord de Saint-Domingue à l'Assemblée National(...TRUNCATED) | 1790 | None | 88 | 4,390 | 27,171 | "\n \nt >ï \nADRESSE \nDe VAssemblée provinciale de la 'partie du Nord de Saint Dominguë ? à (...TRUNCATED) |
bpt6k98000779 | 77 | "Premiere feuille du catalogue des livres qui sont à vendre chez Née de La Rochelle, libraire, rue(...TRUNCATED) | 1787 | None | 10 | 2,858 | 14,732 | "\nP RE M 1ER E -~ F EU IL LE \n•J? qui font à vendre cher. NÉss DE LA R O CH^^^^ùbmi re, rue d(...TRUNCATED) |
bpt6k5768417g | 100 | Encyclopédie de banque et de bourse. Tome 5 | 1929-1931 | None | 2,047 | 269,970 | 1,651,178 | "\nENCYCLOPÉDIE \nDE \nBANQUE \nET DE \nBOURSE \n \n \nENCYCLOPEDIE \nDE \nBANQUE \nET DE \nBOURSE (...TRUNCATED) |
bpt6k9767525w | 95 | Histoire de Lucie Wellers. Tome 2 | 1766 | None | 1,306 | 50,667 | 269,815 | "\n \n \n \n \nHISTOIRE \nDE \n[texte_manquant] \nÉCRITE PAR UNE DAME, TRADUCTION NOUVELLE DE L'ANG(...TRUNCATED) |
bpt6k6560984v | 100 | Mémoires de Saint-Hilaire. III. 1697-1704 | 1903-1916 | Saint Hilaire, Armand de (1651-1740) | 1,096 | 95,607 | 563,629 | "\n \n \n \n \n \n \n \n \n \n \nMÉMOIRES \nDE \nSAINTHILAIRE \"\n) \nPUBLIÉS \nP 0, U R LA SOCIÉ(...TRUNCATED) |
bpt6k11853236 | 89 | "Des Calculs salivaires du conduit de Warthon et des accidents qu'ils déterminent, par le Dr Le Roy(...TRUNCATED) | 1876 | Leroy de Langevinière, Dr | 226 | 17,068 | 103,087 | "\nmmmmrn \n \n \nf \n \n \nTjK \nCalvados ~ s > ! tr: <r .f a i ma \nDES CALCULS S4LWÀIRES \(...TRUNCATED) |
bpt6k6204339t | 100 | L'Abyssinie et sa grande mission | 1900 | None | 166 | 17,053 | 102,287 | "\n \n \n \n \ni L'AHYSSINIIi i. S, \n> I:T SA (iMAM)i; MISSION \nPAR \nUN CATHOLIQUE FRANÇAIS \(...TRUNCATED) |
🇫🇷 French Public Domain Books 🇫🇷
French-Public Domain-Book or French-PD-Books is a large collection aiming to agregate all the French monographies in the public domain.
The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for Gallicagram and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and ngram search on very large cultural heritage datasets in French and other languages.
Content
As of January 2024, the collection contains 289,000 books (16,407,292,362 words) from the French National Library (Gallica). Each parquet file has the full text of 2,000 books selected at random and few core metadatas (Gallica id, title, author, word counts…). The metadata can be easily expanded thanks to the BNF API.
This initial agregation was made possible thanks to the open data program of the French National Library and the consolidation of public domain status for cultural heritage works in the EU with the 2019 Copyright Directive (art. 14)
The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years.
Uses
The primary use of the collection is for cultural analytics project on a wide scale. It is already in use by the Gallicagram project, an open and significantly enhanced version of ngram viewer.
The collection also aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes.
License
The entire collection is in the public domain everywhere. This means that the patrimonial rights of each individual or collective rightholders have expired.
The French National Library claims additional rights in its terms of use and restricts commercial use: "La réutilisation commerciale de ces contenus est payante et fait l'objet d'une licence. Est entendue par réutilisation commerciale la revente de contenus sous forme de produits élaborés ou de fourniture de service ou toute autre réutilisation des contenus générant directement des revenus."
There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive states that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation." (art. 14)
Future developments
This dataset is not a one time work but will continue to evolve significantly on three directions:
- Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction.
- Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well formatted. Major enhancements could be experted through applying new SOTA layout recognition models (like COLAF) on the original PDF files.
- Expansion of the collection to other cultural heritage holdings, especially coming from Hathi Trust, Internet Archive and Google Books.
Acknowledgements
The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).
Corpus collection has been largely facilitated thanks to the open science LLM community insights and cooperation (Occiglot, Eleuther AI, Allen AI).
- Downloads last month
- 1,398