Question

我有这个示例XML文件：

['Chapter 1', 'John Smith', 'Chapter 2', 'John Doe']

此XML可能有多个级别（即超过2个），并且可能包含其他标记。我希望提取除“content”标签之外的所有文本，以便我得到一个字符串列表如下：

public class B
{ 
    public static void main(String[] args) {
        String[] s = {"a"};
        A.main(s);
    }
}

class A
{
   public static void main(String arg[])
   { int a=10;
     int b=20;

     System.out.println(a+" "+b);
   }

}

我正在使用ElementTree实现此任务。有没有优雅，干净的解决方案？

Answer 1

import bs4

xml = '''<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
  <author>John Smith</author>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
 <author>John Doe</author>
</page>'''

soup = bs4.BeautifulSoup(xml, 'lxml')
[(page.title.text, page.author.text)for page in soup('page')]

出：

[('Chapter 1', 'John Smith'), ('Chapter 2', 'John Doe')]

使用BeautifulSoup作为XML解析器，您可以参考Document

Python：提取除某些标记之外的XML文本

1 个答案: