我想解析以下示例中的所有文本块(TEXT CONTENT,BODY CONTENT和EXTRA CONTENT)。您可能会注意到,所有这些文本块在每个“ p”标记内的位置都不同。
<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>
我想以表格格式显示最终结果:
Col1 Col2 Col3
TITLE CONTENT #1 BODY CONTENT #1 EXTRA CONTENT #1
TITLE CONTENT #2 BODY CONTENT #2 EXTRA CONTENT #2
TITLE CONTENT #3 BODY CONTENT #3 EXTRA CONTENT #3
我尝试过
for i in soup.find_all('p'):
title = i.find('strong')
if not isinstance(title.nextSibling, NavigableString):
body= title.nextSibling.nextSibling
extra= body.nextSibling.nextSibling
else:
if len(title.nextSibling) > 3:
body= title.nextSibling
extra= body.nextSibling.nextSibling
else:
body= title.nextSibling.nextSibling.nextSibling
extra= body.nextSibling.nextSibling
但是看起来效率不高。我想知道是否有人有更好的解决方案?
任何帮助将不胜感激!
谢谢!
答案 0 :(得分:1)
请务必注意,.next_sibling
也可以正常工作,因为您可能需要收集多个文本节点,因此必须使用一些逻辑来知道调用它的次数。在此示例中,我发现仅浏览后代就更容易了,这些后代注意到了一些重要特征,这些特征暗示着我要做一些不同的事情。
您只需要分解要抓取的特征。在这种简单的情况下,我们知道:
strong
元素时,我们想要捕获“标题”。br
元素时,我们想开始捕获“内容”。br
元素时,我们想开始捕获“额外内容”。我们可以:
plans
类以获取所有计划。plans
的所有后代节点。from bs4 import BeautifulSoup as bs
from bs4 import Tag, NavigableString
html = """
<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>
"""
soup = bs(html, 'html.parser')
content = []
# Iterate through all the plans
for plans in soup.select('.plans'):
# Lists that will hold the text nodes of interest
title = []
body = []
extra = []
current = None # Reference to one of the above lists to store data
br = 0 # Count number of br tags
# Iterate through all the descendant nodes of a plan
for node in plans.descendants:
# See if the node is a Tag/Element
if isinstance(node, Tag):
if node.name == 'strong':
# Strong tags/elements contain our title
# So set the current container for text to the the title list
current = title
elif node.name == 'br':
# We've found a br Tag/Element
br += 1
if br == 1:
# If this is the first, we need to set the current
# container for text to the body list
current = body
elif br == 2:
# If this is the second, we need to set the current
# container for text to the extra list
current = extra
elif isinstance(node, NavigableString) and current is not None:
# We've found a navigable string (not a tag/element), so let's
# store the text node in the current list container.
# NOTE: You may have to filter out things like HTML comments in a real world example.
current.append(node)
# Store the captured title, body, and extra text for the current plan.
# For each list, join the text into one string and strip leading and trailing whitespace
# from each entry in the row.
content.append([''.join(entry).strip() for entry in (title, body, extra)])
print(content)
然后您可以随时打印数据,但是您应该以一种很好的逻辑方式捕获数据,如下所示:
[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]
执行此操作的方法有多种,这只是一种。
答案 1 :(得分:0)
使用切片的另一种方式,假设您的列表不可变
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"), "html.parser")
def slicing(l):
new_list = []
for i in range(0,len(l),3):
new_list.append(l[i:i+3])
return new_list
result = slicing(list(soup.stripped_strings))
print(result)
输出
[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]
答案 2 :(得分:0)
在这种情况下,您可以将BeautifulSoup的get_text()
方法与separator=
参数一起使用:
data = '''<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print('{: ^25}{: ^25}{: ^25}'.format('Col1', 'Col2', 'Col3'))
for p in [[i.strip() for i in p.get_text(separator='|').split('|') if i.strip()] for p in soup.select('p.plans')]:
print(''.join('{: ^25}'.format(i) for i in p))
打印:
Col1 Col2 Col3
TITLE CONTENT #1 BODY CONTENT #1 EXTRA CONTENT #1
TITLE CONTENT #2 BODY CONTENT #2 EXTRA CONTENT #2
TITLE CONTENT #3 BODY CONTENT #3 EXTRA CONTENT #3