Question

我正在尝试仅从网页中提取文本，但我遇到了一些问题，例如不在页面中编写的文本，但是它们是用代码编写的，例如：＆＃ 34;包括页脚＆＃34;，＆＃34; sidebar.php end＆＃34;此外，不想要的东西也来了，我真的不想要。以下是我用于测试用例的链接，即：

1）http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html

2）http://www.tutorialspoint.com/cplusplus/index.htm

3）http://www.cplusplus.com/doc/tutorial/program_structure/

（这样我可以确保我的代码从任何页面中提取文本）

这是我遇到麻烦的代码：

import urllib from bs4 import BeautifulSoup url = "http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/" html = urllib.urlopen(url).read() soup = BeautifulSoup(html) for script in soup(["script", "style","a","p","li","<!-->","small","<div id=\"footer\">","<div id=\"footer\">","<div id=\"bottom\">"]): script.extract() text = soup.findAll(text=True) for p in text: print unicode(p) fo = open('file.txt', 'w') fo.seek(0, 2) fo.writelines( unicode(p) ) fo.close()

在这段代码中，我使用了1号链接，当我在该页面上＆＃34;检查元素＆＃34; 时，我在该代码中发现了很多注释，这段代码将它们提取为好。请帮忙.....

Answer 1

当您的代码遇到正则表达式匹配的行作为注释时，一种方法是使用正则表达式来删除/跳过注释。

或者，您也可以使用HTML解析器。 Python在其标准库中内置了一个。

https://docs.python.org/2/library/htmlparser.html

如何在Python中从网页中提取文本时避免注释

1 个答案: