Question

我写了一个剪贴簿，它从网站上删除数据，但遗憾的是网站上的数据不一致。有时段落是使用<p>标签编写的，有时不是。（下面给出的代码片段）有没有动态的方式来了解它？

生成错误的部分代码

main_content = soup.findAll("div", {"class": "story-detail"})
content = ""
for div in main_content:
    links = div.findAll('p')
    for a in links:
        a = str(a).strip('<p>')
        a = str(a).strip('/>')
        a = str(a).strip('<')
        a = str(a).strip('<br>')
        content = content + a

Answer 1

您可以通过text属性获取所有文字。在这种情况下，您不必担心底层结构。

示例：

>>> soup = Soup(first, 'html.parser')
>>> soup
<div class="story-detail">test</div>

>>> soup.find('div').text
'test'
>>> soup = Soup(second, 'html.parser')
>>> soup
<div class="story-detail">another <p>test</p></div>

>>> soup.find('div').text
'another test'

Answer 2

如果您要完成的工作是从文本中删除所有<p>和</p>标记，我会使用正则表达式，如下所示：

main_content = ["a div without p tags", "<p>a div with p tags</p>"]

import re

for i in range(0,len(main_content)):
   main_content[i] = re.sub("<p>|</p>","",main_content[i])

无任何标签提取段落数据，Beatifulsoup

2 个答案: