我试图在Python 3.5中使用lxml来抓取一个网站,但我在从网站的某个部分获得满意的结果时遇到了问题。
这是该部分的基本格式:
<div class="field-clearfix">
<div class="field-label">Heading</div>
<div class="field-items">
<div class="field-item even">
<p>
Text script <a href="URL" target=\"_blank\>[ABCD]</a>.
Another text script <a href="URL" target=\"_blank\>[BCDE]</a>, text.
Another text text script <a href="URL" target=\"_blank\>[FGHI]</a>, text.
</p>
</div>
</div>
</div>
现在我用它:
page = requests.get(URL_TO_SCRAPE)
tree = html.fromstring(page.content)
output = tree.xpath('//div[contains(@class,"field-clearfix")]/div[2]/div/p/text()')
但当然,只返回Text script
。我真正喜欢的是输出包含所有非HTML标记的文本:
Text script [ABCD] Another text script [BCDE], text. Another text text script [FGHI], text.
我非常擅长Python和抓取,所以我怀疑使用lxml这是一个非常简单的解决方案,我没有到达这里。非常感谢任何帮助!
答案 0 :(得分:3)
获取元素下的所有文本节点并加入:
"".join(tree.xpath('//div[contains(@class,"field-clearfix")]/div[2]/div/p//text()'))
# NOTE THIS EXTRA SLASH^
请注意您的HTML格式不正确,应该修复此问题才能生效。对于我的HTML固定版本,它适用于我:
<div class="field-clearfix">
<div class="field-label">Heading</div>
<div class="field-items">
<div class="field-item even">
<p>
Text script <a href="URL" target="_blank">[ABCD]</a>.
Another text script <a href="URL" target="_blank">[BCDE]</a>, text.
Another text text script <a href="URL" target="_blank">[FGHI]</a>, text.
</p>
</div>
</div>
</div>
答案 1 :(得分:1)
使用@ alexcxe修改过的HTML,可以解决这个问题:
from bs4 import BeautifulSoup
string = '''<div class="field-clearfix">
<div class="field-label">Heading</div>
<div class="field-items">
<div class="field-item even">
<p>
Text script <a href="URL" target="_blank">[ABCD]</a>.
Another text script <a href="URL" target="_blank">[BCDE]</a>, text.
Another text text script <a href="URL" target="_blank">[FGHI]</a>, text.
</p>
</div>
</div>
</div>'''
soup = BeautifulSoup(string, 'html.parser')
paragraphs = soup.find_all('p')
result = [x.text for x in paragraphs]
result = " ".join(x for x in result[0].split())
结帐result
:
>>> result
'Text script [ABCD]. Another text script [BCDE], text. Another text text script [FGHI], text.'