我试图解析Lotus Notes文档链接(从剪贴板中取出)以将其转换为notes://
URL / URI。从剪贴板选项中,似乎从文本格式获取数据是更容易转换的方式。但是,该链接看起来像一个非常糟糕的XML,并且lxml在解析时会丢失信息。
data = """Name - Enc: Injeção
<NDL>
<REPLICA 83257B7B:00608A81>
<VIEW OFDCBCE5C7:007D345D-ON882572F4:00650240>
<NOTE OFD18FCA06:36A9EDA2-ON83257F6A:004E31C1>
<HINT>CN=SERV101/OU=RJ/OU=C/O=Company</HINT>
<REM>Database 'Name', View 'Inbox', Document 'Enc: Injeção'</REM>
</NDL>
"""
from lxml import html, etree
title, ndl = html.fragments_fromstring(data)
replica = ndl[0]
view = replica[0]
print replica.attrib
print view.attrib
print html.tostring(ndl)
打印:
{}
{'ofdcbce5c7:007d345d-on882572f4:00650240': ''}
<ndl>
<replica>
<view ofdcbce5c7:007d345d-on882572f4:00650240>
<note ofd18fca06:36a9eda2-on83257f6a:004e31c1>
<hint>CN=SERV101/OU=RJ/OU=C/O=Company</hint>
<rem>Database 'Name', View 'Inbox', Document 'Enc: Injeção'</rem>
</note></view></replica></ndl>
所以,我丢失了REPLICA
标签中的信息,即使我仍然从VIEW
标签中获取了一些信息(我怀疑连字符可能会在这里产生差异)。
那么,有没有办法用lxml获取所有数据,还是必须恢复到RegExp?
环境信息:
答案 0 :(得分:0)
您可能会更好地找到bs4:
data = """Name - Enc: Injeção
<NDL>
<REPLICA 83257B7B:00608A81>
<VIEW OFDCBCE5C7:007D345D-ON882572F4:00650240>
<NOTE OFD18FCA06:36A9EDA2-ON83257F6A:004E31C1>
<HINT>CN=SERV101/OU=RJ/OU=C/O=Company</HINT>
<REM>Database 'Name', View 'Inbox', Document 'Enc: Injeção'</REM>
</NDL>
"""
from lxml.etree import fromstring, HTMLParser
xml = fromstring(data, HTMLParser())
r = xml.xpath("//replica")
from bs4 import BeautifulSoup
soup = BeautifulSoup(data,"html.parser")
title = next(soup.find("ndl").previous_elements)
print(title)
print(soup.find("replica").attrs)
print(soup.find("view"))
这给了你:
Name - Enc: Injeção
{u'83257b7b:00608a81': ''}
view ofdcbce5c7:007d345d-on882572f4:00650240="">
<note ofd18fca06:36a9eda2-on83257f6a:004e31c1="">
<hint>CN=SERV101/OU=RJ/OU=C/O=Company</hint>
<rem>Database 'Name', View 'Inbox', Document 'Enc: Injeção'</rem>
</note></view>