我通过比较两组ID(ID集来自字典{ID: XML "RECORD" element}
)来生成ID的公共列表。有了公用列表后,我想对其进行迭代并从字典(将其写入磁盘)中检索与ID对应的值。
当我使用diff_comm_checker函数计算公共ID列表时,无法检索ID对应的字典值。但是,它不会因KeyError而失败。我也可以打印出ID。
当我将ID硬编码为common_id值时,我可以检索dict值。
即
common_ids = diff_comm_checker( list_1, list_2, "text")
# does nothing - no failures
common_ids = ['0603599998140032MB']
#gives me:
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE788>
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE3E0>
因此,我怀疑字符串之间是否存在某些差异。我检查了两个函数的输出,并使用以下命令将它们与硬编码值进行了比较:
print [(_id, type(_id), repr(_id)) for _id in common_ids][0]
我对两者完全相同:
>>> ('0603599998140032MB', <type 'str'>, "'0603599998140032MB'")
我还遵循了另一个问题的建议,并使用了difflib.ndiff:
common_ids1 = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "text")
common_ids = ['0603599998140032MB']
print "\n".join(difflib.ndiff(common_ids1, common_ids))
>>> 0603599998140032MB
同样,两者之间似乎没有任何区别。
下面是完整的代码示例:
from StringIO import StringIO
import xml.etree.cElementTree as ET
from itertools import chain, islice
def diff_comm_checker(list_1, list_2, text):
"""Checks 2 lists. If no difference, pass. Else return common set between two lists"""
symm_diff = set(list_1).symmetric_difference(list_2)
if not symm_diff:
pass
else:
mismatches_in1_not2 = set(list_1).difference( set(list_2) )
mismatches_in2_not1 = set(list_2).difference( set(list_1) )
if mismatches_in1_not2:
mismatch_logger(
mismatches_in1_not2,"{}\n1: {}\n2: {}".format(text, list_1, list_2), 1, 2)
if mismatches_in2_not1:
mismatch_logger(
mismatches_in2_not1,"{}\n2: {}\n1: {}".format(text, list_1, list_2), 2, 1)
set_common = set(list_1).intersection( set(list_2) )
if set_common:
return sorted(set_common)
else:
return "no common set: {}\n".format(text)
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def get_elements_iteratively(file):
"""Create unique ID out of image number and case number, return it along with corresponding xml element"""
tag = "RECORD"
tree = ET.iterparse(StringIO(file), events=("start","end"))
context = iter(tree)
_, root = next(context)
for event, record in context:
if event == 'end' and record.tag == tag:
xml_element_2 = ''
xml_element_1 = ''
for child in record.getchildren():
if child.tag == "IMAGE_NUMBER":
xml_element_1 = child.text
if child.tag == "CASE_NUM":
xml_element_2 = child.text
r_id = "{}{}".format(xml_element_1, xml_element_2)
record.set("R", r_id)
yield (r_id, record)
root.clear()
def get_chunks(file, chunk_size):
"""Breaks XML into chunks, yields dict containing unique IDs and corresponding xml elements"""
iterable = get_elements_iteratively(file)
for chunk in chunks(iterable, chunk_size):
ids_records = {}
for k in chunk:
ids_records[k[0]]=k[1]
yield ids_records
def create_new_xml(xml_list):
chunk = 5000
chunk_rec_ids_1 = get_chunks(xml_list[0], chunk)
chunk_rec_ids_2 = get_chunks(xml_list[1], chunk)
to_write = [chunk_rec_ids_1, chunk_rec_ids_2]
######################################################################################
### WHAT'S GOING HERE ??? WHAT'S THE DIFFERENCE BETWEEN THE OUTPUTS OF THESE TWO ? ###
common_ids = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
#common_ids = ['0603599998140032MB']
######################################################################################
for _id in common_ids:
print _id
for gen_obj in to_write:
for kv_pair in gen_obj:
if kv_pair[_id]:
print _id, kv_pair[_id].attrib, kv_pair[_id]
if __name__ == '__main__':
xml_1 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
xml_2 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
create_new_xml([xml_1, xml_2])
答案 0 :(得分:0)
问题不在于从diff_comm_checker返回的common_ids的类型或值。问题在于函数diff_comm_checker或在构造函数的参数时破坏了to_write的值
如果您尝试这样做,您会明白我的意思
common_ids = ['0603599998140032MB']
diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
这将导致错误的行为,而不使用diff_comm_checker()的返回值
这是因为to_write是一个生成器,而对diff_comm_checker的调用耗尽了该生成器。当在循环的if语句中使用时,生成器将完成/清空。您可以使用 list
从生成器创建列表:chunk_rec_ids_1 = list(get_chunks(xml_list[0], chunk))
chunk_rec_ids_2 = list(get_chunks(xml_list[1], chunk))
但这可能还有其他含义(内存使用情况...)
此外,在diff_comm_checker中此构造的目的是什么?
if not symm_diff:
pass
我认为无论symm_diff是否为 None ,都不会发生任何事情。