我有多个内联div(它们是'标题),以及下面的段落标记(不是在div中),理论上是“孩子们”......我希望把它转换成字典。我无法找到最佳方法。这大致是网站的样子:
<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<p>This should be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>
我工作的python代码如下:
soup = bs.BeautifulSoup(source,'lxml')
full_discussion = soup.find(attrs={'class' : 'field field-type-text field-field-discussion'})
ava_discussion = full_discussion.find(attrs = {'class': 'field-item odd'})
for div in ava_discussion.find_all("div"):
discussion = []
if div.findNextSibling('p'):
discussion.append(div.findNextSibling('p').get_text())
location = div.get_text()
ava_dict.update({location: {"discussion": discussion}}
然而,问题是此代码仅添加FIRST <p>
标记,然后移动到下一个div。最后,我想我希望将每个<p>
添加到discussion
的列表中。救命啊!
更新
添加while
循环会产生第一个
标记的副本,其中存在多少个
标记。这是代码:
for div in ava_discussion.find_all("div"):
ns = div.nextSibling
discussion = []
while ns is not None and ns.name != "div":
if ns.name == "p":
discussion.append(div.findNextSibling('p').get_text())
ns = ns.nextSibling
location = div.get_text()
ava_dict.update({location : {"discussion": discussion}})
print(json.dumps(ava_dict, indent=2))
答案 0 :(得分:1)
我没有添加正确的文字。此代码有效:
for div in ava_discussion.find_all("div"):
ns = div.nextSibling
discussion = []
while ns is not None and ns.name != "div":
if ns.name == "p":
discussion.append(ns.get_text())
ns = ns.nextSibling
location = div.get_text()
ava_dict.update({location : {"discussion": discussion}})
print(json.dumps(ava_dict, indent=2))
答案 1 :(得分:0)
这个怎么样?
paragraphs = div.findNextSiblings('p')
for sibling in div.findNextSiblings():
if sibling in paragraphs:
discussion.append(sibling.get_text())
else:
break
现在,谁能告诉我如何让这更优雅:)