Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Character Encoding Based on Meta Tags in ScrapeWebsite #2650

Open
1 task done
AiraNadih opened this issue May 17, 2024 · 0 comments
Open
1 task done

Handling Character Encoding Based on Meta Tags in ScrapeWebsite #2650

AiraNadih opened this issue May 17, 2024 · 0 comments

Comments

@AiraNadih
Copy link

Description:

In rare cases, some websites have encoding issues where the response body is garbled. However, browsers can correctly recognize the encoding due to the meta tag in the content, such as:

<meta http-equiv='content-type' content='text/html; charset=gb2312' />

I would like to request the addition of related handling logic in ScrapeWebsite to address this issue. This logic should be implemented before the following code:

htmlDocumentReader, err := charset.NewReader(
    responseHandler.Body(config.Opts.HTTPClientMaxBodySize()),
    responseHandler.ContentType(),
)

This will ensure that the correct character encoding is applied, similar to how browsers handle it.

Thank you for considering this request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant