Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to load "stas/c4-en-10k" dataset since 2.16 version #6908

Closed
guch8017 opened this issue May 20, 2024 · 2 comments
Closed

Fail to load "stas/c4-en-10k" dataset since 2.16 version #6908

guch8017 opened this issue May 20, 2024 · 2 comments
Assignees

Comments

@guch8017
Copy link

guch8017 commented May 20, 2024

Describe the bug

When update datasets library to version 2.16+ ( I test it on 2.16, 2.19.0 and 2.19.1), using the following code to load stas/c4-en-10k dataset

from datasets import load_dataset, Dataset
dataset = load_dataset('stas/c4-en-10k')

and then it raise UnicodeDecodeError like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 2523, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 2195, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 1846, in dataset_module_factory
    raise e1 from None
  File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 1798, in dataset_module_factory
    can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
  File "/home/*/conda3/envs/watermark/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I found that fs.open loads a gzip file and parses it like plain text using utf-8 encoder.

fs = HfFileSystem('https://huggingface.co')
fs.open("datasets/stas/c4-en-10k/c4-en-10k.py", "rb")
data = fs.read()    # data is gzip bytes begin with b'\x1f\x8b\x08\x00\x00\tn\x88\x00...' 
data2 = unzip_gzip_bytes(data)    #  data2 is what we want: '# coding=utf-8\n# Copyright 2020 The HuggingFace Datasets...'

Steps to reproduce the bug

  1. Install datasets between version 2.16 and 2.19
  2. Use datasets.load_dataset method to load stas/c4-en-10k dataset.

Expected behavior

Load dataset normally.

Environment info

Platform = Linux-5.4.0-159-generic-x86_64-with-glibc2.35
Python = 3.10.14
Datasets = 2.19

@albertvillanova albertvillanova self-assigned this May 22, 2024
@albertvillanova
Copy link
Member

I am not able to reproduce the error with datasets 2.19.1:

In [1]: from datasets import load_dataset; ds = load_dataset("stas/c4-en-10k", streaming=True); item = next(iter(ds["train"])); item
Out[1]: {'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.'}

In [2]: from datasets import load_dataset; ds = load_dataset("stas/c4-en-10k", download_mode="force_redownload"); ds
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.3M/13.3M [00:00<00:00, 18.7MB/s]
Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 78548.55 examples/s]
Out[2]: 
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 10000
    })
})

Looking at your error traceback, I notice that the code line numbers do not correspond to the ones of datasets 2.19.1.

Additionally, I can't reproduce the issue with HfFileSystem:

In [1]: from huggingface_hub import HfFileSystem

In [2]: fs = HfFileSystem()

In [3]: with fs.open("datasets/stas/c4-en-10k/c4-en-10k.py", "rb") as f:
   ...:     data = f.read()
   ...: 

In [4]: data[:20]
Out[4]: b'# coding=utf-8\n# Cop'

Could you please verify the datasets and huggingface_hub versions you are indeed using?

import datasets; print(datasets.__version__)

import huggingface_hub; print(huggingface_hub.__version__)

@guch8017
Copy link
Author

Thanks for your reply! After I update the datasets version from 2.15.0 back to 2.19.1 again, it seems everything work well. Sorry for bordering you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants