Question

我有一个evernote数据的数据集（参考前面的问题）每个笔记的标签列表包括（标题，创建，更新，纬度，经度，mime，时间戳，文件名）。我能够将这些特定元素作为列表提取出来并存在我的问题。首先，我使用BeautifulSoup

将每个标记声明为变量

soup = BeautifulSoup(open('myNotes.xml','r'))
title = soup.findAll('title')
created = soup.findAll('created')
updated = soup.findAll('updated')
latitude = soup.findAll('latitude')
longitude = soup.findAll('longitude')
mime = soup.findAll('mime')
timestamp = soup.findAll('timestamp')

all = title + created
print all

打印每个标记的所有结果，然后继续下一个。每个note元素都包含所有这些标记，我希望它使用所有前面的标记打印每一行，以保持每个笔记列表的完整性。

这个想法是为了让它显示为：注意:(标题，创建，更新，纬度，经度，哑剧，时间戳，文件名）注意:(标题，创建，更新，纬度，经度，哑剧，时间戳，文件名）注意:(标题，创建，更新，纬度，经度，哑剧，时间戳，文件名）

而不是：标题标题，创建创建创建，纬度纬度纬度，经度...你得到的图片。当我print all

时

这是我的一些数据 - <title> UX observation </title> , <title> UI framework. </title> , <title> Attachment:AudioNote-2011-04-04_092442.amr </title> , <title> Snapshot </title> , <title> Tableau </title> , <title> Jquery plugins. </title> , <title> Sacred geometry </title> , <title> Audio from 625 Hyde St in San Francisco </title> , <title> Potential coding resources </title>

首先打印所有标签，然后移动到标签并执行相同操作。问题是我松开了包含每个标签的行。我希望每个标题与其对应的，一行（作为单个音符）一起显示，然后转到下一组。希望澄清一下。

Answer 1

您应该查找所需的主标记，然后解析它的子标记，而不是只搜索多个标记的所有匹配项。请考虑以下evernote XML导出：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
<en-export export-date="20131123T093001Z" application="Evernote" version="Evernote Mac 5.4.3 (402231)">
<note><title>Untitled Note</title><content><![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
Test Entry
<div>Another Entry</div>
<div>Yet Another Entry<span style="-evernote-last-insertion-point:true;"/></div>
</en-note>
]]></content><created>20131123T092930Z</created><updated>20131123T092953Z</updated><note-attributes><author>Steven Parker</author><reminder-order>0</reminder-order></note-attributes></note>
</en-export>

你可以解析这样的音符：

from lxml import etree
with open(u'/path/to/my_evernote_file.enex', 'rb') as src_file:
    my_xml_file = src_file.read()
root = etree.fromstring(my_xml_file)

现在您已有权访问根节点，您可以找到note标记元素，这就是您所追求的内容：

for note in root.xpath('//note'):  # Locate all tags under the root named 'note'. There's one.
    my_values = dict(
        title = note.xpath('title'),
        created = note.xpath('created'),
        updated = note.xpath('updated'),
        latitude = note.xpath('latitude'),
        longitude = note.xpath('longitude'),
        mime = note.xpath('mime'),
        timestamp = note.xpath('timestamp'),
    )

my_values现在看起来像我的示例文件：

{'created': [<Element created at 0x10b53a410>],  # notice, it's a list of matching children tags.
 'latitude': [],  # My example didn't contain these keys!
 'longitude': [],
 'mime': [],
 'timestamp': [],
 'title': [<Element title at 0x10b53a3c0>],
 'updated': [<Element updated at 0x10b53a460>]}

除了查找特定项目，您还可以迭代注释的所有子标记，如下所示：

for note in root.xpath('//note'):  # Locate all tags under the root named 'note'. There's one.
    for child in note.getchildren():
        print child.tag, repr(child.text)

输出如下内容：

title 'Untitled Note'
content '<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\n<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\nTest Entry\n<div>Another Entry</div>\n<div>Yet Another Entry<span style="-evernote-last-insertion-point:true;"/></div>\n</en-note>\n'
created '20131123T092930Z'
updated '20131123T092953Z'
note-attributes None

希望这有助于指明你的方向！

从xml文件中提取一组标记的每个实例并将其解析为列

1 个答案: