Question

我试图为我的应用程序搜索一些数据。我的问题是我需要一些这是HTML代码：

<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>

我希望输出看起来像

这是第一句话这是第二句话这是第三句话。

有可能吗？

Answer 1

这当然是可能的。我会稍微回答一下，因为我怀疑你只想处理那一大块HTML。

首先，获取指向td元素的指针

td = soup.find('td')

现在，请注意您可以获取此元素的列表

>>> td_kids = list(td.children)
>>> td_kids
['\n    This\n    ', <a class="tip info" href="blablablablabla">is a first</a>, '\n    sentence.\n    ', <br/>, '\n    This\n    ', <a class="tip info" href="blablablablabla">is a second</a>, '\n    sentence.\n    ', <br/>, 'This\n    ', <a class="tip info" href="blablablablabla">is a third</a>, '\n    sentence.\n    ', <br/>, '\n']

此列表中的某些项目是字符串，有些是HTML元素。至关重要的是，有些是br元素。

您可以通过查找

将列表首先拆分为一个或多个列表

isinstance(td_kid[<some k>], bs4.element.Tag)

列表中的每个项目。

然后，您可以通过将标签变成汤然后获取这些标签来反复替换每个子列表。最后，您将有几个子列表，其中仅包含BeautifulSoup所调用的＆＃39;可导航字符串＆＃39;你可以像往常一样操纵。

将元素加入到一起，然后我建议您使用正则表达式sub消除空白区域：

result = re.sub(r'\s{2,}', '', <joined list>)

Answer 2

试试这个。它应该给你想要的输出。只需将以下脚本中使用的content变量视为上述粘贴html elements的持有者。

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)

输出：

This is a first sentence. 
This is a second sentence. 
This is a third sentence.

Answer 3

您可以使用bs4和基本字符串操作轻松完成此操作，如下所示：

from bs4 import BeautifulSoup

data = '''
<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>
'''

soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
    print ' '.join(i.text.split()).replace('. ', '.\n')

这将作为输出：

This is a first sentence.
This is a second sentence.
This is a third sentence.

Answer 4

htmlText = """<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>"""
from bs4 import BeautifulSoup
# these two steps are to put everything into one line. may not be necessary for you
htmlText = htmlText.replace("\n", " ")
while "  " in htmlText:
    htmlText = htmlText.replace("  ", " ")

# import into bs4
soup = BeautifulSoup(htmlText, "lxml")

# using https://stackoverflow.com/a/34640357/5702157
for br in soup.find_all("br"):
    br.replace_with("\n")

parsedText = soup.get_text()
while "\n " in parsedText:
    parsedText = parsedText.replace("\n ", "\n") # remove spaces at the start of new lines
print(parsedText.strip())

如何使用BeautifulSoup4在标签之前获取所有文本

4 个答案: