Question

我有一些xml格式如下：

  <Paragraph Type="Character">
   <Text>
    TED
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    I thought we had a rule against that.
   </Text>
  </Paragraph>
  <Paragraph Type="Character">
   <Text>
    ANNIE
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    ...oh.

我正在尝试提取数据，使其看起来像这样：

Character   Dialogue

TED         I thought we had a rule against that.
ANNIE       ...oh.

我一直在尝试：

soup.find(Type = "Character").get_text()
soup.find(Type = "Dialogue").get_text()

一次将返回一行。当我尝试使用soup.find_all做多个操作时，即：

soup.find_all(Type = "Character").get_text()

我得到了错误：

AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

我知道find_all()返回一个元素数组（由于先前的回答：https://stackoverflow.com/a/21997788/8742237），我应该在数组中选择一个元素，但是我想获得所有数组中的元素变成我上面显示的格式。

Answer 1

要获取成对的Character和Dialogue，可以使用zip()方法：

html_data = '''  <Paragraph Type="Character">
   <Text>
    TED
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    I thought we had a rule against that.
   </Text>
  </Paragraph>
  <Paragraph Type="Character">
   <Text>
    ANNIE
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    ...oh.
   </Text>
  </Paragraph>
  '''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_data, 'html.parser')

print('{: <10} {}'.format('Character', 'Dialogue'))
print()
for character, dialogue in zip(soup.select('[Type="Character"]'), soup.select('[Type="Character"] + [Type="Dialogue"]')):
    print('{: <10} {}'.format( character.get_text(strip=True), dialogue.get_text(strip=True)) )

打印：

Character  Dialogue

TED        I thought we had a rule against that.
ANNIE      ...oh.

CSS选择器[Type="Character"] + [Type="Dialogue"]将选择Type=Dialogue的标签，该标签紧跟Type=Character的标签之后

更多内容：CSS Selectors Reference

Answer 2

您是否尝试过遍历数组并获取类似的文本？

[x.get_text() for x in soup.find_all(Type = "Character")]

该数组没有get_text（）属性，但元素应该具有。

Answer 3

我一直在寻找Andrej Kesely的答案： https://stackoverflow.com/a/57484760/8742237

以防万一将来有人在看这个问题的人是初学者，这是我试图将其分解的尝试：

list1 = [x.get_text(strip = True) for x in soup.select('[Type="Character"]')]
print(list1)

list2 = [x.get_text(strip = True) for x in soup.select('[Type="Dialogue"]')]
print(list2)

zip1 = zip(list1, list2)
print(list(zip1))

BeautifulSoup：“ find_all”和“ get_text”

3 个答案: