Question

我有一些像这样的HTML代码

df <- data.frame(group =rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

library(forcats)
df$groupLump <- fct_lump(df$group, 2)

library(ggplot2)
ggplot(df) +
  geom_bar(aes(x = groupLump, fill = groupLump))

我使用了Beautifulsoup，这是我的代码：

<p><span class="map-sub-title">abc</span>123</p>

我得到了结果＆＃39; abc123＆＃39;

但我希望得到结果＆＃39; 123＆＃39;不是＆＃39; abc123＆＃39;

Answer 1

您可以使用功能decompose()删除span标记，然后获取所需的文本。

from bs4 import BeautifulSoup

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")

for span in soup.find_all("span", {'class':'map-sub-title'}):
    span.decompose()

print(soup.text)

Answer 2

您还可以使用extract()删除不需要的代码，然后再从下面的代码中获取文字。

from bs4 import BeautifulSoup

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()

print(soup1.text)

Answer 3

其中一种方法是在父标记上使用contents（在本例中为<p>）。

如果你知道字符串的位置，你可以直接使用它：

>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'

如果您想要一个通用的解决方案，在您不知道位置的情况下，您可以检查内容的类型是否为NavigableString，如下所示：

>>> final_text = [x for x in soup.find('p').contents if isinstance(x, NavigableString)]
>>> final_text
['123']

使用第二种方法，您将能够获得直接属于<p>标记的子文本的所有文本。为了完整起见，这是另外一个例子：

>>> html = '''
... <p>
...     I want
...     <span class="map-sub-title">abc</span>
...     foo
...     <span class="map-sub-title">abc2</span>
...     text
...     <span class="map-sub-title">abc3</span>
...     only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'

Answer 4

尽管此线程上的每个响应都可以接受，但我会指出另一种方法：

soup.find("span", {'class':'map-sub-title'}).next_sibling

您可以使用next_sibling在同一parent上的元素之间导航，在这种情况下为p代码。

Answer 5

如果标签内有多个内容，您仍然可以只查看字符串。使用.strings生成器：

>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'

Beautifulsoup获取没有下一个标签的内容

5 个答案: