HTML main textual content extraction

boilerpipe 1

boilerpipe provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page.

Demo and source code.

For python3 2

jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages.

Some other alternatives is here.

More reading about HTML full text extraction