我有一个迭代器(数百万行),它为我提供了一个字典,需要与条件字典进行比较才能找到匹配项。
这是我的代码:
conditions={"port":"0-20", "ip":"1.2.3.4", "protocol":"1,7",
"timestamp":">143990000", "server":"mario"}
for rec in imiterator(): # Very large number of rows
# rec examples {"ip":"1.7.1.1", "timestamp":1434000,
# "port":129,"server":("mario","bruno"),
# "protocol":"1","port":19"}
if check_conditions(rec, conditions):
print(json.dumps(rec))
请注意rec
中的列可以是int
,long
,string
,tuple
。
我需要找到一种真正的高性能方式来进行比赛。有什么想法吗?
我考虑过使用map
并将条件转换为lambda函数,这些函数应匹配并执行所有条件的AND运算。这会更快吗?
答案 0 :(得分:0)
这是我所做的,将我的条件转换为lambda函数的字典,并使用第一个记录对象来确定我想要它的函数类型 - 基本字符串匹配,数字匹配或范围(大于和小于甘蔗的范围也是0-100或-100小于100和100-大于100)
def check_condition(rec,val):
matched=True # by default return true if there is no questions asked
for cond in val:
cmatch=val[cond](rec.__getattribute__(cond))
if not cmatch: # current match is false just return False for this record
return False
matched=matched and cmatch
return matched
conditions={"port":"0-20","ip":"1.2.3.4","protocol":"1,7","timestamp":"143990000-","server":"!mario"}
def check_condition(rec,val):
matched=True # by default return true if there is no questions asked
for cond in val:
cmatch=val[cond](rec.__getattribute__(cond))
if not cmatch: # current match is false just return False for this record
return False
matched=matched and cmatch
return matched
def number_logic_to_lambda(invar,switcharoo):
"""
This subroutine does check the request to be either a range query or a comma delimited list or a single number
"""
z=invar.split('-') # range queries
if len(z) == 1:
y=map(int,invar.split(',')) # multiple conditions
if len(y) == 1: # a single match
return lambda x: (x == int(y[0]))^switcharoo
else: # This is a comma delimited convert to list and send match condition back
return lambda x: (x in y)^switcharoo
elif len(z) == 2: # This is a query with "-"
if z[1] == '': # This is a greater than query
return lambda x: (int(z[0]) <= x)^switcharoo
elif x[0] == '': # This is a less than query
return lambda x: (int(z[0]) >= x)^switcharoo
else: # This is range query
return lambda x: (int(z[0]) <= x <= int(z[1]))^switcharoo
iter=imiterator()
first_rec=next(iter)
nvars={} # This is conditions changed into functions
for svar in conditions:
qvar=conditions[svar]
switcharoo=False
if qvar.startswith("!"): # Start with a bang it is a negative condition
qvar=qvar.replace("!","")
switcharoo=True
mapf=lambda x: x == qvar # default mapping function full string match
if isinstance(cattr,int): #the next three treat them as numeric
mapf=number_logic_to_lambda(qvar,switcharoo)
elif isinstance(cattr,float): # float is also treated like a number
mapf=number_logic_to_lambda(float(qvar),switcharoo)
elif isinstance(cattr,long): # long is also treated numeric
mapf=number_logic_to_lambda(qvar,switcharoo)
elif isinstance(cattr,tuple): # Tuples can use set intersection
print set(qvar.split(","))
mapf=lambda x: (set(qvar.split(",")).issubset(set(x))) ^ switcharoo
nvars[svar]=mapf # update the dictionary of mapped functions
rec=next(iter,None)
while rec: # very large number of rows
#rec examples {"ip":"1.7.1.1","timestamp":1434000,"port":129,"server":("mario","bruno"), "protocol":"1","port":19"}
if check_conditions(rec,nvars):
print json.dumps(rec)
rec=next(iter,None)
答案 1 :(得分:-1)
如果目标是检查1:1的对应关系,为什么不将rec的所有条目转换为字符串而只是做一个,
return conditions == rec
如果目标是处理和理解范围内的数据,那么它可能具有不同的处理标准,但是对于这样的简单任务,使用map只会增加开销。