如何使用BeautifulSoup4在标签之前获取所有文本

时间:2018-02-10 15:50:41

标签: python html beautifulsoup scrapy

我试图为我的应用程序搜索一些数据。我的问题是我需要一些 这是HTML代码:

<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>

我希望输出看起来像

  

这是第一句话   这是第二句话   这是第三句话。

有可能吗?

4 个答案:

答案 0 :(得分:2)

这当然是可能的。我会稍微回答一下,因为我怀疑你只想处理那一大块HTML。

首先,获取指向td元素的指针

td = soup.find('td')

现在,请注意您可以获取此元素的列表

>>> td_kids = list(td.children)
>>> td_kids
['\n    This\n    ', <a class="tip info" href="blablablablabla">is a first</a>, '\n    sentence.\n    ', <br/>, '\n    This\n    ', <a class="tip info" href="blablablablabla">is a second</a>, '\n    sentence.\n    ', <br/>, 'This\n    ', <a class="tip info" href="blablablablabla">is a third</a>, '\n    sentence.\n    ', <br/>, '\n']

此列表中的某些项目是字符串,有些是HTML元素。至关重要的是,有些是br元素。

您可以通过查找

将列表首先拆分为一个或多个列表
isinstance(td_kid[<some k>], bs4.element.Tag)

列表中的每个项目。

然后,您可以通过将标签变成汤然后获取这些标签来反复替换每个子列表。最后,您将有几个子列表,其中仅包含BeautifulSoup所调用的&#39;可导航字符串&#39;你可以像往常一样操纵。

将元素加入到一起,然后我建议您使用正则表达式sub消除空白区域:

result = re.sub(r'\s{2,}', '', <joined list>)

答案 1 :(得分:2)

试试这个。它应该给你想要的输出。只需将以下脚本中使用的content变量视为上述粘贴html elements的持有者。

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)

输出:

This is a first sentence. 
This is a second sentence. 
This is a third sentence.

答案 2 :(得分:2)

您可以使用bs4和基本字符串操作轻松完成此操作,如下所示:

from bs4 import BeautifulSoup

data = '''
<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>
'''

soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
    print ' '.join(i.text.split()).replace('. ', '.\n')

这将作为输出:

This is a first sentence.
This is a second sentence.
This is a third sentence.

答案 3 :(得分:1)

htmlText = """<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>"""
from bs4 import BeautifulSoup
# these two steps are to put everything into one line. may not be necessary for you
htmlText = htmlText.replace("\n", " ")
while "  " in htmlText:
    htmlText = htmlText.replace("  ", " ")

# import into bs4
soup = BeautifulSoup(htmlText, "lxml")

# using https://stackoverflow.com/a/34640357/5702157
for br in soup.find_all("br"):
    br.replace_with("\n")

parsedText = soup.get_text()
while "\n " in parsedText:
    parsedText = parsedText.replace("\n ", "\n") # remove spaces at the start of new lines
print(parsedText.strip())