使用lxml和请求抓取html时是否包含带有<strong>和<em>标记的文本?

时间:2019-04-20 23:08:13

标签: python xpath web-scraping python-requests lxml

我正在使用lxml和请求从网页中抓取文本。我想要的所有文本都在<p>标签下。当我使用contents = tree.xpath('//*[@id="storytext"]/p/text()')时,contents仅包含不在<em><strong>标记中的文本。但是当我使用contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()')时,<em><strong>标签中的文本与该<p>标签中的其余文本分开。

我想:

  1. 将每个<p>当作一个单元,包括其所有文本(无论是纯文本还是<em><strong>)进行刮除,并且

  2. 保留<em><strong>标签,以便以后使用它们来格式化我抓取的文本。

示例html: <div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>

所需的输出: "Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.

1 个答案:

答案 0 :(得分:0)

如果只有你们之间的人可以使用bs4和replace来删除p个打开和关闭标签

from bs4 import BeautifulSoup as bs

html = '''
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
'''

soup = bs(html,'lxml')

for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))

使用requests来获取html

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))