我正在将BeautifulSoup 4与python一起使用以解析一些HTML。这是代码:
from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'
soup = bs(html_doc, 'html.parser')
para = soup.p
for child in soup.p.children:
print (child)
结果是:
IN
<i>THE </i>
<b>DISTRICT</b>
COURT OF {county} COUNTY
STATE OF OKLAHOMA
这一切都说得通。我要尝试的是遍历结果,如果找到<i>
或<b>
,请对它们做一些不同的事情。当我尝试以下操作时,它不起作用:
for child in soup.p.children:
if child.findChildren('i'):
print('italics found')
错误是因为第一个返回的孩子是一个字符串,我正尝试在其中搜索一个孩子标签,而BS4已经知道不存在任何孩子。
因此,我更改了代码以检查子代是否为字符串,如果是,则不要尝试对其进行任何操作,只需将其打印出来即可。
for child in soup.p.children:
if isinstance(child, str):
print(child)
elif child.findAll('i'):
for tag in child.findAll('i'):
print(tag)
此最新代码的结果:
IN
COURT OF {county} COUNTY
STATE OF OKLAHOMA
当我遍历结果时,我需要能够检查结果中的标签,但似乎无法弄清楚如何。我以为应该很简单,但是我很困惑。
编辑:
响应jacalvo:
如果我跑步
for child in soup.p.children:
if child.find('i'):
print(child)
它仍然无法从HTML代码中打印出第二行和第三行
编辑:
for child in soup.p.children:
if isinstance(child, str):
print(child)
else:
print(child.findChildren('i', recursive=False))
结果是:
IN
[]
[]
COURT OF {county} COUNTY
STATE OF OKLAHOMA
答案 0 :(得分:1)
这是您要尝试做的事情吗,以作为使用标签“做不同的事情”的例子吗?在问题中包含所需的全部输出样本将有助于:
from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE</i> <b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'
soup = bs(html_doc, 'html.parser')
para = soup.p
for child in para.children:
if child.name == 'i':
print(f'*{child.text}*',end='')
elif child.name == 'b':
print(f'**{child.text}**',end='')
else:
print(child,end='')
输出:
IN *THE* **DISTRICT** COURT OF {county} COUNTY
STATE OF OKLAHOMA
答案 1 :(得分:0)
from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} ' \
'COUNTY\nSTATE OF OKLAHOMA</p> '
soup = bs(html_doc, 'html.parser')
paragraph = soup.p
# all tags dynamically gotten
tags = [tag.name for tag in soup.find_all()]
for child in paragraph.children:
if child.name in tags:
print('{0}'.format(child)) # or child.text
else:
print(child)
输出
IN
<i>THE </i>
<b>DISTRICT</b>
COURT OF {county} COUNTY
STATE OF OKLAHOMA
答案 2 :(得分:0)
使用findChildren
(),然后使用条件检查子名称。
from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'
soup = bs(html_doc, 'html.parser')
for child in soup.find('p').findChildren(recursive=False) :
if child.name=='i':
print(child)
if child.name=='b':
print(child)
<i>THE </i>
<b>DISTRICT</b>