BeautifulSoup:“ find_all”和“ get_text”

时间:2019-08-13 19:56:53

标签: python xml beautifulsoup

我有一些xml格式如下:

  <Paragraph Type="Character">
   <Text>
    TED
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    I thought we had a rule against that.
   </Text>
  </Paragraph>
  <Paragraph Type="Character">
   <Text>
    ANNIE
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    ...oh.  

我正在尝试提取数据,使其看起来像这样:

Character   Dialogue

TED         I thought we had a rule against that.
ANNIE       ...oh. 

我一直在尝试:

soup.find(Type = "Character").get_text()
soup.find(Type = "Dialogue").get_text()

一次将返回一行。当我尝试使用soup.find_all做多个操作时,即:

soup.find_all(Type = "Character").get_text()

我得到了错误:

AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

我知道find_all()返回一个元素数组(由于先前的回答:https://stackoverflow.com/a/21997788/8742237),我应该在数组中选择一个元素,但是我想获得所有数组中的元素变成我上面显示的格式。

3 个答案:

答案 0 :(得分:2)

要获取成对的CharacterDialogue,可以使用zip()方法:

html_data = '''  <Paragraph Type="Character">
   <Text>
    TED
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    I thought we had a rule against that.
   </Text>
  </Paragraph>
  <Paragraph Type="Character">
   <Text>
    ANNIE
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    ...oh.
   </Text>
  </Paragraph>
  '''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_data, 'html.parser')

print('{: <10} {}'.format('Character', 'Dialogue'))
print()
for character, dialogue in zip(soup.select('[Type="Character"]'), soup.select('[Type="Character"] + [Type="Dialogue"]')):
    print('{: <10} {}'.format( character.get_text(strip=True), dialogue.get_text(strip=True)) )

打印:

Character  Dialogue

TED        I thought we had a rule against that.
ANNIE      ...oh.

CSS选择器[Type="Character"] + [Type="Dialogue"]将选择Type=Dialogue的标签,该标签紧跟Type=Character的标签之后

更多内容:CSS Selectors Reference

答案 1 :(得分:1)

您是否尝试过遍历数组并获取类似的文本?

[x.get_text() for x in soup.find_all(Type = "Character")]

该数组没有get_text()属性,但元素应该具有。

答案 2 :(得分:0)

我一直在寻找Andrej Kesely的答案: https://stackoverflow.com/a/57484760/8742237

以防万一将来有人在看这个问题的人是初学者,这是我试图将其分解的尝试:

list1 = [x.get_text(strip = True) for x in soup.select('[Type="Character"]')]
print(list1)

list2 = [x.get_text(strip = True) for x in soup.select('[Type="Dialogue"]')]
print(list2)

zip1 = zip(list1, list2)
print(list(zip1))