Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blockScraper implementation #88

Open
lurenss opened this issue Apr 27, 2024 · 3 comments
Open

blockScraper implementation #88

lurenss opened this issue Apr 27, 2024 · 3 comments
Assignees

Comments

@lurenss
Copy link
Collaborator

lurenss commented Apr 27, 2024

Is your feature request related to a problem? Please describe.
A scraper pipeline capable of retrieve all the similar blocks in a page, like ecommerce, weather, fly websites

Describe the solution you'd like
I have found this paper https://www.researchgate.net/publication/261360247_A_Web_Page_Segmentation_Approach_Using_Visual_Semantics
It deals specifically wti this issue

Describe alternatives you've considered
nope

Additional context
Screenshot 2024-04-27 at 15 04 05

@lurenss lurenss self-assigned this Apr 27, 2024
@epage480
Copy link
Contributor

epage480 commented May 6, 2024

Neat idea but would it be simpler to just group web elements with the same css tags? A computer vision approach seems a bit over-engineered.

@lurenss
Copy link
Collaborator Author

lurenss commented May 7, 2024

@epage480 This isn't a CV approach, it's a grouping similar object from HTML, if you want to help to implement this paper with us let me know, here the reference A Web Page Segmentation Approach Using Visual Semantics

@DiTo97
Copy link
Contributor

DiTo97 commented May 13, 2024

@lurenss, I would much rather focus on this paper, an empirical comparison of web page segmentation algorithms, 2021.

much more recent and detailed comparison of all major web page segmentation algorithms.

tl;dr - microsoft's VIPS algorithm is still the best one out there; you can find implementations in java, JS or python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants