我想使用beautifulsoup搜寻FAQ页面,但是在打印数据时遇到了一些问题。
例如:
问:问题1111
A:answer1111
Q:问题2222
A:answer2222
for q in question:
print(q)
for a in answer:
print(a)
输出如下:
question1111
answer1111
answer2222
question2222
answer1111
answer2222
我想要的是这种方式:
question1111
answer1111
question2222
answer2222
然后我尝试使用break
for q in question:
print(q)
for a in answer:
print(a)
break
输出变为:
question1111
answer1111
question2222
answer1111
我尝试继续并通过,但仍无法正常工作
有什么方法可以运行一次内循环,然后返回到外循环重复吗?
添加到下方
html看起来像这样:
<div>
<h4 class="mod-wysiwyg__small-heading">Question1</h4>
</div>
<div>
<p class="mod-wysiwyg__text">Answer1... paragraph1</p>
</div>
<div>
<p class="mod-wysiwyg__text">Answer1...paragraph2</p>
</div>
<div>
<h4 class="mod-wysiwyg__small-heading">Question2</h4>
</div>
<div>
<p class="mod-wysiwyg__text">Answer2</p>
</div>
<div>
<h4 class="mod-wysiwyg__small-heading">Question3</h4>
</div>
抓取html的代码:
if r.status_code == requests.codes.ok:
soup = BeautifulSoup(r.text, 'html.parser')
question = soup.find_all('h4', class_='mod-wysiwyg__small-heading')
answer = soup.find_all('p', class_='mod-wysiwyg__text')
for q, a in zip(question, answer):
print("- - " + q.text[3:], file=open("output.txt",'a'))
print(" - " + a.text, file=open("output.txt",'a'))
输出如下:
Question1
Answer1... paragraph1
Question2
Answer1...paragraph2
Question3
Answer2
答案 0 :(得分:0)
遍历每个问题,然后遍历下一个兄弟姐妹,以收集答案的各个段落,直到遇到新问题为止(因为我们不想收集下一个问题的答案):
result = []
for question in soup.select("h4.mod-wysiwyg__small-heading"):
paragraphs = []
for sibling in question.parent.find_next_siblings("div"):
if sibling.h4: # new question, exit
break
answer = sibling.find('p', class_='mod-wysiwyg__text')
if answer:
paragraphs.append(answer.text)
result.append((question.text, " ".join(paragraphs)))
示例HTML的输出:
[(u'Question1', u'Answer1... paragraph1 Answer1...paragraph2'),
(u'Question2', u'Answer2'),
(u'Question3', '')]
答案 1 :(得分:0)
如果每个答案和问题都没有包装在块div
中,请转到.parent
和.find_next_sibling()
soup = BeautifulSoup(html, 'html.parser')
question = soup.find_all('h4', class_='mod-wysiwyg__small-heading')
for q in question:
firstAnswer = q.parent.find_next_sibling('div').find('p')
# or
# .find('p', class_="mod-wysiwyg__text")
print(q.text)
print(firstAnswer.text)