尝试在特定<div>下刮取所有文本,同时忽略HTML标记

时间:2017-01-04 23:11:36

标签: python web-scraping lxml

我试图在Python 3.5中使用lxml来抓取一个网站,但我在从网站的某个部分获得满意的结果时遇到了问题。

这是该部分的基本格式:

<div class="field-clearfix">
    <div class="field-label">Heading</div>
    <div class="field-items">
        <div class="field-item even">
        <p>
        Text script <a href="URL" target=\"_blank\>[ABCD]</a>.
        Another text script <a href="URL" target=\"_blank\>[BCDE]</a>, text. 
        Another text text script <a href="URL" target=\"_blank\>[FGHI]</a>, text.
        </p>
        </div>
    </div>
</div>

现在我用它:

page = requests.get(URL_TO_SCRAPE)
tree = html.fromstring(page.content)
output = tree.xpath('//div[contains(@class,"field-clearfix")]/div[2]/div/p/text()')

但当然,只返回Text script。我真正喜欢的是输出包含所有非HTML标记的文本:

Text script [ABCD] Another text script [BCDE], text. Another text text script [FGHI], text.

我非常擅长Python和抓取,所以我怀疑使用lxml这是一个非常简单的解决方案,我没有到达这里。非常感谢任何帮助!

2 个答案:

答案 0 :(得分:3)

获取元素下的所有文本节点并加入:

"".join(tree.xpath('//div[contains(@class,"field-clearfix")]/div[2]/div/p//text()'))
                                                   # NOTE THIS EXTRA SLASH^

请注意您的HTML格式不正确,应该修复此问题才能生效。对于我的HTML固定版本,它适用于我:

<div class="field-clearfix">
    <div class="field-label">Heading</div>
    <div class="field-items">
        <div class="field-item even">
        <p>
        Text script <a href="URL" target="_blank">[ABCD]</a>.
        Another text script <a href="URL" target="_blank">[BCDE]</a>, text.
        Another text text script <a href="URL" target="_blank">[FGHI]</a>, text.
        </p>
        </div>
    </div>
</div>

答案 1 :(得分:1)

使用@ alexcxe修改过的HTML,可以解决这个问题:

from bs4 import BeautifulSoup

string = '''<div class="field-clearfix">
    <div class="field-label">Heading</div>
    <div class="field-items">
        <div class="field-item even">
        <p>
        Text script <a href="URL" target="_blank">[ABCD]</a>.
        Another text script <a href="URL" target="_blank">[BCDE]</a>, text.
        Another text text script <a href="URL" target="_blank">[FGHI]</a>, text.
        </p>
        </div>
    </div>
</div>'''

soup = BeautifulSoup(string, 'html.parser')

paragraphs = soup.find_all('p')

result = [x.text for x in paragraphs]

result = " ".join(x for x in result[0].split())

结帐result

>>> result
'Text script [ABCD]. Another text script [BCDE], text. Another text text script [FGHI], text.'