Dataset Preview
Full Screen
The full dataset viewer is not available (click to read why). Only showing a preview of the rows.
The dataset generation failed
Error code:   DatasetGenerationError
Exception:    CastError
Message:      Couldn't cast
directory: string
identifier: string
...1: int64
creator: string
language: string
title: string
publication_date: int64
lang: string
real_lang: string
n: int64
rights: string
file: string
word_count: int64
text: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1844
to
{'identifier': Value(dtype='string', id=None), 'creator': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'publication_date': Value(dtype='string', id=None), 'word_count': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None)}
because column names don't match
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1492, in compute_config_parquet_and_info_response
                  fill_builder_info(builder, hf_endpoint=hf_endpoint, hf_token=hf_token, validate=validate)
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 683, in fill_builder_info
                  ) = retry_validate_get_features_num_examples_size_and_compression_ratio(
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 602, in retry_validate_get_features_num_examples_size_and_compression_ratio
                  validate(pf)
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 640, in validate
                  raise TooBigRowGroupsError(
              worker.job_runners.config.parquet_and_info.TooBigRowGroupsError: Parquet file has too big row groups. First row group has 433980469 which exceeds the limit of 300000000
              
              During handling of the above exception, another exception occurred:
              
              Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1995, in _prepare_split_single
                  for _, table in generator:
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 797, in wrapped
                  for item in generator(*args, **kwargs):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/parquet/parquet.py", line 97, in _generate_tables
                  yield f"{file_idx}_{batch_idx}", self._cast_table(pa_table)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/parquet/parquet.py", line 75, in _cast_table
                  pa_table = table_cast(pa_table, self.info.features.arrow_schema)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2302, in table_cast
                  return cast_table_to_schema(table, schema)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2256, in cast_table_to_schema
                  raise CastError(
              datasets.table.CastError: Couldn't cast
              directory: string
              identifier: string
              ...1: int64
              creator: string
              language: string
              title: string
              publication_date: int64
              lang: string
              real_lang: string
              n: int64
              rights: string
              file: string
              word_count: int64
              text: string
              -- schema metadata --
              pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1844
              to
              {'identifier': Value(dtype='string', id=None), 'creator': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'publication_date': Value(dtype='string', id=None), 'word_count': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None)}
              because column names don't match
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1505, in compute_config_parquet_and_info_response
                  parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1099, in stream_convert_to_parquet
                  builder._prepare_split(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1882, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 2038, in _prepare_split_single
                  raise DatasetGenerationError("An error occurred while generating the dataset") from e
              datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

identifier
string
creator
string
title
string
publication_date
string
word_count
string
text
string
__index_level_0__
int64
0000048348
Gil y Robles, Enrique , 1849-1908 //
Apuntes de Derecho Político según el indice-programa de la obra del Sr. Gil y Robles
1903 - 1904
215781
"_\n\nja\n\nfe\n\n.\n\nV\n\n*\n\n^\n\n-\n\n.\n\n.... ,.\n\n' <^\n\nv\n\n' <gí\n\ne s ? . * \"\n\n\n(...TRUNCATED)
1
0000085124
Marcoartu, Arturo de , 1827-1904 //
Líneas submarinas telegráficas de Europa a las Américas, del Atlántico al Pacífico
1863
12134
"\n\n\n\n\n\nEMPRESA\n\nTELEGRAFICA\n\nUNIVERSAL.\n\nLINEAS\n\n8UBMAEINA8\n\nTELEGEAFIOAS\n\nDE EURO(...TRUNCATED)
2
0000206732
nan
Pasión infame (un espantoso drama de adulterio)
1900?
7393
"COLECCION\n\nESCOGIDA\n\nA\n\nO\n\nC T S.\n\nPOPULAR .Versos para postales, cartas, álbuizis y aba(...TRUNCATED)
3
0000234020
Maura y Montaner, Antonio , 1853-1925 // Catalá y Gavilá, Juan Bautista , n. 1861 //
Ideario político
1918
72390
"T\n« i? ORADOilES\nColección de sus o b ra s m aestras\nV Esta Bibüute«?» cout«adrá, en libr(...TRUNCATED)
4
0000259618
"Medina Sarauz, Catalina de -ptf // Rueda, Miguel de -demandante ptf // Muriel, Catalina -demandan(...TRUNCATED)
"Por doña Catalina de Medina Sarauz, biuda del Secretario Alonso Muriel, y vsufructuaria de sus bie(...TRUNCATED)
1607
3010
"Doña Catalina de\nna\n\nMedi-\n\nSarauz, biuda del\n\nSecretario Alonfb\n\nfus bienes: Miguel de R(...TRUNCATED)
7
0000191362
Nicolas, Auguste , 1807-1888 //
Estudios filosóficos sobre el cristianismo
1845 - 1846
594086
"\n1\n\n\n\n\nESTUDIOS FILOSOFICOS\n\nson lili\n\nEL CRISTIANISMO.\n\n\n\nESTUDIOS FILOSOFICOS\nSO V(...TRUNCATED)
8
0000107438
Le Bas, Philippe , 1794-1860 //
Manual de historia romana, desde la fundación de Roma hasta la caída del Imperio de Occidente
1845
222071
"© Biblioteca Nacional de España\n\n\n© Biblioteca Nacional de España\n\n\nJ\n© Biblioteca Naci(...TRUNCATED)
10
0000073100
Gómez de Avellaneda, Gertrudis , 1814-1873 //
La hija de las flores o Todos están locos drama en tres actos y en verso
1852
21751
"\n\n\n\nLA HUÍ DE U S FLORES,\nó\nDRAMA EN T R E S ACTOS, Y EN VERSO, POR\n\nLA EXMA. SRA. DOÑA (...TRUNCATED)
12
0000142917
San Miguel y de Otero, Vizconde de //
"Soneto acrostico a la muerte de la Reyna nuestra señora doña Maria Luisa de Borbon, que goza de D(...TRUNCATED)
1689
349
"•\n\n© Biblioteca Nacional de España\n\n\n© Biblioteca Nacional de España\n\n\n,\n\n•\n\n(...TRUNCATED)
15
0000107159
Sardá y Salvany, Félix , 1844-1916 //
Los frailes holgazanes
1899
2921
"DETODO EL M U N D O\nР О В\n\nS. ? S. ;\nLUI.\n\nÎLos iraiïe»\n\nborane*.\n\n\nCON L I C E N (...TRUNCATED)
16
End of preview.
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/datasets-cards)

🇪🇸 Spanish Public Domain Books 🇪🇸

Spanish-Public Domain-Newspapers or Spanish-PD-Newspapers is a large collection aiming to aggregate all Spanish monographies in the public domain. As of March 2024, with Spanish-PD-Newspapers, it is the biggest Spanish open corpus.

Dataset summary

The collection contains 302,640 individual texts making up 13.9 billion words recovered from multiple sources, including Spanish leading cultural heritage institution Biblioteca Digitale Hispanica (BDH) and Internet Archive. Each parquet file has the full text of 2,000 books selected at random.

Curation method

The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. Additionally, the initial consolidation of public domain status for cultural heritage operates in the EU under the 2019 Copyright Directive (art. 14).

Uses

The collection aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes.

The rationales for creation of this collection are multifold:

  • Scientific: We observe that the closure of training corpora represents a major barrier to AI research. Large language models face a real crisis of reproducibility.
  • Legal: With the adoption of the AI Act with its obligations in terms of copyright law compliance for the pretraining corpora, the European AI ecosystem will have to change its provenance practices.
  • Cultural: The linguistic diversity of the European Union is currently underrepresented. Unlike web archives, open, heritage, administrative, or scientific texts are often of high quality: they are long, multilingual, and editorialized publications.
  • Economical: Today, value capture is concentrated on players whose financial resources are already considerable, allowing them to collect or purchase data at a high price. Making a royalty-free corpus available to as many people as possible frees innovation in uses and minimizes economic dependencies on dominant actors.

License

The entire collection is in the public domain in all regions. This means that the patrimonial rights of each individual or collective right holders have expired.

There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive states that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation." (art. 14)

Future work

This dataset is not a one-time work but will continue to evolve significantly in three directions:

  • Expansion of the dataset to the late 19th and early 20th century works and its further enhancement with currently unexploited collections coming from European patrimonial data repositories.
  • Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction.
  • Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well-formatted.

Acknowledgements

The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).

Corpus collection has been largely facilitated thanks to the open science LLM community insights, cooperation and support (Occiglot, Eleuther AI, OpenLLM France, Allen AI).

Downloads last month
1,415

Collection including PleIAs/Spanish-PD-Books