Question

我正在使用lxml和请求从网页中抓取文本。我想要的所有文本都在标签下。当我使用contents = tree.xpath('//*[@id="storytext"]/p/text()')时，contents仅包含不在或标记中的文本。但是当我使用contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()')时，和标签中的文本与该标签中的其余文本分开。

我想：

将每个当作一个单元，包括其所有文本（无论是纯文本还是或）进行刮除，并且
保留和标签，以便以后使用它们来格式化我抓取的文本。

示例html： <div id="storytext">"Go away!" His voice was drowned out by the mixer. She didn't even hear him. He could scrub it all day, probably, and Esti would just say can't you do anything? He scowled fiercely at the dirt.</div>

所需的输出： "Go away!" His voice was drowned out by the mixer. She didn't even hear him. He could scrub it all day, probably, and Esti would just say can't you do anything? He scowled fiercely at the dirt.

Answer 1

如果只有你们之间的人可以使用bs4和replace来删除p个打开和关闭标签

from bs4 import BeautifulSoup as bs

html = '''
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
'''

soup = bs(html,'lxml')

for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))

使用requests来获取html

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))

使用lxml和请求抓取html时是否包含带有<strong>和<em>标记的文本？

1 个答案: