我有一个evernote数据的数据集(参考前面的问题)每个笔记的标签列表包括(标题,创建,更新,纬度,经度,mime,时间戳,文件名)。我能够将这些特定元素作为列表提取出来并存在我的问题。 首先,我使用BeautifulSoup
将每个标记声明为变量soup = BeautifulSoup(open('myNotes.xml','r'))
title = soup.findAll('title')
created = soup.findAll('created')
updated = soup.findAll('updated')
latitude = soup.findAll('latitude')
longitude = soup.findAll('longitude')
mime = soup.findAll('mime')
timestamp = soup.findAll('timestamp')
all = title + created
print all
打印每个标记的所有结果,然后继续下一个。 每个note元素都包含所有这些标记,我希望它使用所有前面的标记打印每一行,以保持每个笔记列表的完整性。
这个想法是为了让它显示为: 注意:(标题,创建,更新,纬度,经度,哑剧,时间戳,文件名) 注意:(标题,创建,更新,纬度,经度,哑剧,时间戳,文件名) 注意:(标题,创建,更新,纬度,经度,哑剧,时间戳,文件名)
而不是:
标题标题,创建创建创建,纬度纬度纬度,经度...你得到的图片。
当我print all
这是我的一些数据 -
<title>
UX observation
</title>
,
<title>
UI framework.
</title>
,
<title>
Attachment:AudioNote-2011-04-04_092442.amr
</title>
,
<title>
Snapshot
</title>
,
<title>
Tableau
</title>
,
<title>
Jquery plugins.
</title>
,
<title>
Sacred geometry
</title>
,
<title>
Audio from 625 Hyde St in San Francisco
</title>
,
<title>
Potential coding resources
</title>
首先打印所有标签,然后移动到标签并执行相同操作。 问题是我松开了包含每个标签的行。我希望每个标题与其对应的,一行(作为单个音符)一起显示,然后转到下一组。 希望澄清一下。
答案 0 :(得分:0)
您应该查找所需的主标记,然后解析它的子标记,而不是只搜索多个标记的所有匹配项。请考虑以下evernote XML导出:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
<en-export export-date="20131123T093001Z" application="Evernote" version="Evernote Mac 5.4.3 (402231)">
<note><title>Untitled Note</title><content><![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
Test Entry
<div>Another Entry</div>
<div>Yet Another Entry<span style="-evernote-last-insertion-point:true;"/></div>
</en-note>
]]></content><created>20131123T092930Z</created><updated>20131123T092953Z</updated><note-attributes><author>Steven Parker</author><reminder-order>0</reminder-order></note-attributes></note>
</en-export>
你可以解析这样的音符:
from lxml import etree
with open(u'/path/to/my_evernote_file.enex', 'rb') as src_file:
my_xml_file = src_file.read()
root = etree.fromstring(my_xml_file)
现在您已有权访问根节点,您可以找到note
标记元素,这就是您所追求的内容:
for note in root.xpath('//note'): # Locate all tags under the root named 'note'. There's one.
my_values = dict(
title = note.xpath('title'),
created = note.xpath('created'),
updated = note.xpath('updated'),
latitude = note.xpath('latitude'),
longitude = note.xpath('longitude'),
mime = note.xpath('mime'),
timestamp = note.xpath('timestamp'),
)
my_values
现在看起来像我的示例文件:
{'created': [<Element created at 0x10b53a410>], # notice, it's a list of matching children tags.
'latitude': [], # My example didn't contain these keys!
'longitude': [],
'mime': [],
'timestamp': [],
'title': [<Element title at 0x10b53a3c0>],
'updated': [<Element updated at 0x10b53a460>]}
除了查找特定项目,您还可以迭代注释的所有子标记,如下所示:
for note in root.xpath('//note'): # Locate all tags under the root named 'note'. There's one.
for child in note.getchildren():
print child.tag, repr(child.text)
输出如下内容:
title 'Untitled Note'
content '<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\n<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\nTest Entry\n<div>Another Entry</div>\n<div>Yet Another Entry<span style="-evernote-last-insertion-point:true;"/></div>\n</en-note>\n'
created '20131123T092930Z'
updated '20131123T092953Z'
note-attributes None
希望这有助于指明你的方向!