-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to load "stas/c4-en-10k" dataset since 2.16 version #6908
Comments
I am not able to reproduce the error with datasets 2.19.1: In [1]: from datasets import load_dataset; ds = load_dataset("stas/c4-en-10k", streaming=True); item = next(iter(ds["train"])); item
Out[1]: {'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.'}
In [2]: from datasets import load_dataset; ds = load_dataset("stas/c4-en-10k", download_mode="force_redownload"); ds
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.3M/13.3M [00:00<00:00, 18.7MB/s]
Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 78548.55 examples/s]
Out[2]:
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 10000
})
}) Looking at your error traceback, I notice that the code line numbers do not correspond to the ones of datasets 2.19.1. Additionally, I can't reproduce the issue with In [1]: from huggingface_hub import HfFileSystem
In [2]: fs = HfFileSystem()
In [3]: with fs.open("datasets/stas/c4-en-10k/c4-en-10k.py", "rb") as f:
...: data = f.read()
...:
In [4]: data[:20]
Out[4]: b'# coding=utf-8\n# Cop' Could you please verify the import datasets; print(datasets.__version__)
import huggingface_hub; print(huggingface_hub.__version__) |
Thanks for your reply! After I update the datasets version from 2.15.0 back to 2.19.1 again, it seems everything work well. Sorry for bordering you! |
Describe the bug
When update datasets library to version 2.16+ ( I test it on 2.16, 2.19.0 and 2.19.1), using the following code to load stas/c4-en-10k dataset
and then it raise UnicodeDecodeError like
I found that fs.open loads a gzip file and parses it like plain text using utf-8 encoder.
Steps to reproduce the bug
datasets.load_dataset
method to loadstas/c4-en-10k
dataset.Expected behavior
Load dataset normally.
Environment info
Platform = Linux-5.4.0-159-generic-x86_64-with-glibc2.35
Python = 3.10.14
Datasets = 2.19
The text was updated successfully, but these errors were encountered: