lxml - 在凌乱的php中抓取文本

时间:2012-12-02 07:25:40

标签: web-scraping lxml

我正在重写一个写得不好的网站,最初用php编写。

我正在尝试隔离p标签中的文本,并想知道我如何只采用文本部分。有什么想法吗?

<p>
<span lang="EN-IE" xml:lang="EN-IE">

<br>
TEXT SAMPLE 1
<br>
<br>
TEXT SAMPLE 2

<span lang="EN-IE" xml:lang="EN-IE">TEXT SAMPLE 3
</span>,

<span lang="EN-IE" xml:lang="EN-IE">&nbsp;TEXT SAMPLE 4
</span>&nbsp;TEXT SAMPLE 5

<span lang="EN-IE" xml:lang="EN-IE">.&nbsp;</span>

</span><span lang="EN-IE" xml:lang="EN-IE">

<br>
<br>

TEXT SAMPLE 6
</span>

<span lang="EN-IE" xml:lang="EN-IE">&nbsp;</span>

TEXT SAMPLE 7

1 个答案:

答案 0 :(得分:0)

BeautifulSoup是个好地方。特别是get_text函数。

这将输出上面代码段中的所有文字:

from bs4 import BeautifulSoup

CONTENT = """
<p>
<span lang="EN-IE" xml:lang="EN-IE">

<br>
TEXT SAMPLE 1
<br>
<br>
TEXT SAMPLE 2

<span lang="EN-IE" xml:lang="EN-IE">TEXT SAMPLE 3
</span>,

<span lang="EN-IE" xml:lang="EN-IE">&nbsp;TEXT SAMPLE 4
</span>&nbsp;TEXT SAMPLE 5

<span lang="EN-IE" xml:lang="EN-IE">.&nbsp;</span>

</span><span lang="EN-IE" xml:lang="EN-IE">

<br>
<br>

TEXT SAMPLE 6
</span>

<span lang="EN-IE" xml:lang="EN-IE">&nbsp;</span>

TEXT SAMPLE 7
"""

if __name__ == '__main__':
    soup = BeautifulSoup(CONTENT)
    print soup.get_text()

输出可能需要一些字符串操作,因为有许多新行,但这将剥离HTML。