我想从以下xml格式中提取问题(type ='q')和答案(type ='a')对作为单个数据点:
<?xml version="1.0" encoding="us-ascii"?>
<transcript id="001" >
<body>
<section name="Q&A">
<speaker id="0">
<plist>
<p>Thank you. We'll now be conducting the question-and-answer session. <mark type="Operator Instructions" /> Thank you. Please go ahead with your question.</p>
</plist>
</speaker>
<speaker id="3" type="q">
<plist>
<p>Good morning. First of all, Happy New Year.</p>
</plist>
</speaker>
<speaker id="2" type="a">
<plist>
<p>Happy New Year, sir.</p>
</plist>
</speaker>
<speaker id="3" type="q">
<plist>
<p>Thank you. How is your pain now?.</p>
</plist>
</speaker>
<speaker id="2" type="a">
<plist>
<p>Oh, it's better now. I think i am healing.</p>
</plist>
</speaker>
</section>
</body>
</transcript>
即输出应为:['早上好。首先,新年快乐。先生,新年快乐。”,“谢谢。您现在的痛苦如何?哦,现在好了。我想我正在康复。']
有人可以帮我用美丽汤做吗?我当前的代码提取了文档中的所有<p>
标签,但是问题在于还有其他部分(“ Q&A”除外)以及它们的<p>
标签也被提取了。
soup = BeautifulSoup(handler, "html.parser")
texts = []
for node in soup.findAll('p'):
text = " ".join(node.findAll(text=True))
#text = clean_text(text)
texts.append(text)
答案 0 :(得分:1)
您可以使用findAll('speaker', {"type": "q"})
查找问题,并使用findNext("speaker")
查找相应的答案。
例如:
from bs4 import BeautifulSoup
soup = BeautifulSoup(handler, "html.parser")
for node in soup.findAll('speaker', {"type": "q"}):
print( node.find("p").text )
print( node.findNext("speaker").find("p").text)
print( "--" )
输出:
Good morning. First of all, Happy New Year.
Happy New Year, sir.
--
Thank you. How is your pain now?.
Oh, it's better now. I think i am healing.
--
答案 1 :(得分:1)
您可以分别使用find_all('speaker', type='q')
和find_all('speaker', type='a')
查找所有问题和所有答案。然后使用zip
加入相应的问题及其答案。
代码:
questions = soup.find_all('speaker', type='q')
answers = soup.find_all('speaker', type='a')
for q, a in zip(questions, answers):
print(' '.join((q.p.text, a.p.text)))
输出:
Good morning. First of all, Happy New Year. Happy New Year, sir.
Thank you. How is your pain now?. Oh, it's better now. I think i am healing.
如果要在列表中使用它,可以使用列表理解:
q_and_a = [' '.join((q.p.text, a.p.text)) for q, a in zip(questions, answers)]
print(q_and_a)
# ['Good morning. First of all, Happy New Year. Happy New Year, sir.',
# "Thank you. How is your pain now?. Oh, it's better now. I think i am healing."]