我正在使用lxml和请求从网页中抓取文本。我想要的所有文本都在<p>
标签下。当我使用contents = tree.xpath('//*[@id="storytext"]/p/text()')
时,contents
仅包含不在<em>
或<strong>
标记中的文本。但是当我使用contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()')
时,<em>
和<strong>
标签中的文本与该<p>
标签中的其余文本分开。
我想:
将每个<p>
当作一个单元,包括其所有文本(无论是纯文本还是<em>
或<strong>
)进行刮除,并且
保留<em>
和<strong>
标签,以便以后使用它们来格式化我抓取的文本。
示例html:
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
所需的输出:
"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.
答案 0 :(得分:0)
如果只有你们之间的人可以使用bs4和replace
来删除p个打开和关闭标签
from bs4 import BeautifulSoup as bs
html = '''
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
'''
soup = bs(html,'lxml')
for item in soup.select('p'):
print(str(item).replace('<p>','').replace('</p>',''))
使用requests
来获取html
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('url')
soup = bs(r.content, 'lxml')
for item in soup.select('p'):
print(str(item).replace('<p>','').replace('</p>',''))