我遇到了lxml库的烦人问题,无法弄清楚如何绕过它。
我有一个lxml.etree._ElementTree树的列表和一个属于这些树的lxml.html.HtmlElement列表,并且相应的路径存储在名为paths的列表中
element_found = [True if len(tree.xpath(path)) > 0 else False for tree,path in zip(trees,paths)]
print(element_found.count(False)) # == 0
当我尝试保存路径和树以便稍后检索此状态时,问题就变成了:
trees_to_save = [{'tree': lxml.etree.tostring(tree, pretty_print=True)} for tree in trees]
t2sdf = pd.DataFrame(trees_to_save)
t2sdf.to_csv('trees.csv')
EncodeForamt = lxml.html.HTMLParser(encoding='utf-8')
trees_from_file = pd.read_csv('trees.csv')
trees_from_file['tree'] = trees_from_file['tree'].apply(lambda x: etree.HTML(literal_eval(x),EncodeForamt).getroottree())
然后运行相同的测试:
element_found = [True if len(tree.xpath(path)) > 0 else False for tree,path in zip(trees_from_file,paths)]
print(element_found.count(False)) # == 6 (out of 12k)
通常我试图完成找到的所有路径,显然存在一个问题,要么来自/来自字符串方法以及我如何保存树。我已经在lxml库中尝试了各种方法,例如tree.write而不是string,而不是literal_eval只是.encode('utf-8')无效,有和没有pretty_print,尝试了etree.from_string()一切都是一样的结果......
令人担忧的是,这也会引发XML语法错误:
trees = [etree.fromstring(etree.tostring(t)) for t in trees]
我有点失去如何妥善保存这些树木......
答案 0 :(得分:1)
好的,我想了解如何在尝试我能找到的所有东西后完成这项工作,需要使用parse而不是tostring:
trees_to_save = [{'tree': lxml.etree.tostring(tree,encoding='utf-8',method='html')} for tree in trees]
t2sdf = pd.DataFrame(trees_to_save)
t2sdf.to_csv('location_trees.csv')
trees_from_file = pd.read_csv('location_trees.csv')
EncodeForamt = lxml.etree.HTMLParser(encoding='utf-8')
trees_from_file['tree'] = trees_from_file['tree'].apply(lambda x: lxml.etree.parse(x,parser=EncodeForamt))