我有一些xml格式如下:
<Paragraph Type="Character">
<Text>
TED
</Text>
</Paragraph>
<Paragraph Type="Dialogue">
<Text>
I thought we had a rule against that.
</Text>
</Paragraph>
<Paragraph Type="Character">
<Text>
ANNIE
</Text>
</Paragraph>
<Paragraph Type="Dialogue">
<Text>
...oh.
我正在尝试提取数据,使其看起来像这样:
Character Dialogue
TED I thought we had a rule against that.
ANNIE ...oh.
我一直在尝试:
soup.find(Type = "Character").get_text()
soup.find(Type = "Dialogue").get_text()
一次将返回一行。当我尝试使用soup.find_all
做多个操作时,即:
soup.find_all(Type = "Character").get_text()
我得到了错误:
AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
我知道find_all()
返回一个元素数组(由于先前的回答:https://stackoverflow.com/a/21997788/8742237),我应该在数组中选择一个元素,但是我想获得所有数组中的元素变成我上面显示的格式。
答案 0 :(得分:2)
要获取成对的Character
和Dialogue
,可以使用zip()
方法:
html_data = ''' <Paragraph Type="Character">
<Text>
TED
</Text>
</Paragraph>
<Paragraph Type="Dialogue">
<Text>
I thought we had a rule against that.
</Text>
</Paragraph>
<Paragraph Type="Character">
<Text>
ANNIE
</Text>
</Paragraph>
<Paragraph Type="Dialogue">
<Text>
...oh.
</Text>
</Paragraph>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_data, 'html.parser')
print('{: <10} {}'.format('Character', 'Dialogue'))
print()
for character, dialogue in zip(soup.select('[Type="Character"]'), soup.select('[Type="Character"] + [Type="Dialogue"]')):
print('{: <10} {}'.format( character.get_text(strip=True), dialogue.get_text(strip=True)) )
打印:
Character Dialogue
TED I thought we had a rule against that.
ANNIE ...oh.
CSS选择器[Type="Character"] + [Type="Dialogue"]
将选择Type=Dialogue
的标签,该标签紧跟Type=Character
的标签之后
答案 1 :(得分:1)
您是否尝试过遍历数组并获取类似的文本?
[x.get_text() for x in soup.find_all(Type = "Character")]
该数组没有get_text()属性,但元素应该具有。
答案 2 :(得分:0)
我一直在寻找Andrej Kesely的答案: https://stackoverflow.com/a/57484760/8742237
以防万一将来有人在看这个问题的人是初学者,这是我试图将其分解的尝试:
list1 = [x.get_text(strip = True) for x in soup.select('[Type="Character"]')]
print(list1)
list2 = [x.get_text(strip = True) for x in soup.select('[Type="Dialogue"]')]
print(list2)
zip1 = zip(list1, list2)
print(list(zip1))