HTML main textual content extraction
26 Jul 2021boilerpipe 1
boilerpipe provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page.
Demo and source code.
For python3 2
jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages.
Some other alternatives is here.