我有这个例子xml文件
<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>
我想提取标题标签和内容标签的内容。
使用模式匹配或使用xml模块,哪种方法可以提取数据。或者有更好的方法来提取数据。
答案 0 :(得分:17)
已经有一个内置的XML库,特别是ElementTree
。例如:
>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
... <title>Chapter 1</title>
... <content>Welcome to Chapter 1</content>
... </page>
... <page>
... <title>Chapter 2</title>
... <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
... title = page.find('title').text
... content = page.find('content').text
... print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
答案 1 :(得分:1)
我个人更喜欢使用xml.dom.minidom
进行解析,如下所示:
In [18]: import xml.dom.minidom
In [19]: x = """\
<root><page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page></root>"""
In [28]: doc = xml.dom.minidom.parseString(x)
In [29]: doc.getElementsByTagName("page")
Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]
In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
Out[33]: [u'Chapter 1', u'Chapter 2']
In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']
In [36]: for node in doc.childNodes:
if node.hasChildNodes:
for cn in node.childNodes:
if cn.hasChildNodes:
for cn2 in cn.childNodes:
if cn2.nodeType == cn2.TEXT_NODE:
print cn2.wholeText
Out[37]: Chapter 1
Welcome to Chapter 1
Chapter 2
Welcome to Chapter 2
答案 2 :(得分:0)
您也可以尝试使用以下代码提取文本:
from bs4 import BeautifulSoup
import csv
data ="""<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>"""
soup = BeautifulSoup(data, "html.parser")
########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
title.append(i.get_text())
########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
content.append(i.get_text())
doc1 = list(zip(title, content))
for i in doc1:
print(i)
输出:
('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')
答案 3 :(得分:0)
代码:
from xml.etree import cElementTree as ET
tree = ET.parse("test.xml")
root = tree.getroot()
for page in root.findall('page'):
print("Title: ", page.find('title').text)
print("Content: ", page.find('content').text)
输出:
Title: Chapter 1
Content: Welcome to Chapter 1
Title: Chapter 2
Content: Welcome to Chapter 2
答案 4 :(得分:0)
对于处理(导航,搜索和修改)XML或HTML数据,我发现Beautiful库非常有用。有关安装问题或详细信息,请单击link。
要查找属性(标签)或多属性值:
from bs4 import BeautifulSoup
data = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.48.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF
CANADA</text>
<text top="261" width="86" height="16" font="1">13479 77 AVE</text>
</page>
</pdf2xml>"""
soup = BeautifulSoup(data, "lxml")
page_tag = soup.find_all('page')
details_tag = page_tag[0].find_all('text')
details_tag_count = len(details_tag)
for iter_text in range(details_tag_count):
print("Text : ", details_tag[iter_text].text)
print("Left tag : ", details_tag[iter_text].get("left"))
输出:
Text : PALS SOCIETY OF CANADA
Left tag : 135
Text : 13479 77 AVE
Left tag : None
答案 5 :(得分:0)
向您推荐一个简单的库。这是一个示例:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
sheet B
结果:
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>'''
doc = SimplifiedDoc(html)
pages = doc.pages
print ([(page.title.text,page.content.text) for page in pages])