我试图为我的应用程序搜索一些数据。我的问题是我需要一些 这是HTML代码:
<tr>
<td>
This
<a class="tip info" href="blablablablabla">is a first</a>
sentence.
<br>
This
<a class="tip info" href="blablablablabla">is a second</a>
sentence.
<br>This
<a class="tip info" href="blablablablabla">is a third</a>
sentence.
<br>
</td>
</tr>
我希望输出看起来像
这是第一句话 这是第二句话 这是第三句话。
有可能吗?
答案 0 :(得分:2)
这当然是可能的。我会稍微回答一下,因为我怀疑你只想处理那一大块HTML。
首先,获取指向td
元素的指针
td = soup.find('td')
现在,请注意您可以获取此元素的列表
>>> td_kids = list(td.children)
>>> td_kids
['\n This\n ', <a class="tip info" href="blablablablabla">is a first</a>, '\n sentence.\n ', <br/>, '\n This\n ', <a class="tip info" href="blablablablabla">is a second</a>, '\n sentence.\n ', <br/>, 'This\n ', <a class="tip info" href="blablablablabla">is a third</a>, '\n sentence.\n ', <br/>, '\n']
此列表中的某些项目是字符串,有些是HTML元素。至关重要的是,有些是br
元素。
您可以通过查找
将列表首先拆分为一个或多个列表isinstance(td_kid[<some k>], bs4.element.Tag)
列表中的每个项目。
然后,您可以通过将标签变成汤然后获取这些标签来反复替换每个子列表。最后,您将有几个子列表,其中仅包含BeautifulSoup所调用的&#39;可导航字符串&#39;你可以像往常一样操纵。
将元素加入到一起,然后我建议您使用正则表达式sub
消除空白区域:
result = re.sub(r'\s{2,}', '', <joined list>)
答案 1 :(得分:2)
试试这个。它应该给你想要的输出。只需将以下脚本中使用的content
变量视为上述粘贴html elements
的持有者。
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)
输出:
This is a first sentence.
This is a second sentence.
This is a third sentence.
答案 2 :(得分:2)
您可以使用bs4
和基本字符串操作轻松完成此操作,如下所示:
from bs4 import BeautifulSoup
data = '''
<tr>
<td>
This
<a class="tip info" href="blablablablabla">is a first</a>
sentence.
<br>
This
<a class="tip info" href="blablablablabla">is a second</a>
sentence.
<br>This
<a class="tip info" href="blablablablabla">is a third</a>
sentence.
<br>
</td>
</tr>
'''
soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
print ' '.join(i.text.split()).replace('. ', '.\n')
这将作为输出:
This is a first sentence.
This is a second sentence.
This is a third sentence.
答案 3 :(得分:1)
htmlText = """<tr>
<td>
This
<a class="tip info" href="blablablablabla">is a first</a>
sentence.
<br>
This
<a class="tip info" href="blablablablabla">is a second</a>
sentence.
<br>This
<a class="tip info" href="blablablablabla">is a third</a>
sentence.
<br>
</td>
</tr>"""
from bs4 import BeautifulSoup
# these two steps are to put everything into one line. may not be necessary for you
htmlText = htmlText.replace("\n", " ")
while " " in htmlText:
htmlText = htmlText.replace(" ", " ")
# import into bs4
soup = BeautifulSoup(htmlText, "lxml")
# using https://stackoverflow.com/a/34640357/5702157
for br in soup.find_all("br"):
br.replace_with("\n")
parsedText = soup.get_text()
while "\n " in parsedText:
parsedText = parsedText.replace("\n ", "\n") # remove spaces at the start of new lines
print(parsedText.strip())