我有一个相当大的文本文件(~16k行),我循环并为每一行,检查客户端IP:端口,服务器IP:端口和关键字是否存在于行中,使用两个for循环和嵌套if x in line
语句,用于检查该行是否包含我要查找的信息。
在我确定了包含我正在寻找的值的行之后,我更新了一个sqlite数据库。最初,这需要相当长的时间来执行,因为我没有在手动事务中包装的SQL UPDATE语句。进行此更改后,执行时间显着改善,但是我仍然发现下面的代码需要几分钟才能完成,我觉得我的可怕循环结构是原因。
如果有人有任何性能提示来帮助加快下面的代码,我将非常感激:
c.execute("SELECT client_tuple, origin_tuple FROM connections")
# returns ~ 8k rows each with two items, clientIP:port and serverIP:port
tuples = c.fetchall()
with open('connection_details.txt', 'r') as f:
c.execute('BEGIN TRANSACTION')
# for each line in ~16k lines
for line in f:
# for each row returned from sql query
for tuple in tuples:
# if the client tuple (IP:Port) is in the line
if tuple[0] in line:
# if the origin tuple (IP:Port) is in the line
if tuple[1] in line:
# if 'foo' is in the line
if 'foo' in line:
# lookup some value and update SQL with the value found
bar_value = re.findall(r'(?<=bar\s).+?(?=\,)', line)
c.execute("UPDATE connections "
" SET bar = ? "
"WHERE client_tuple = ? AND origin_tuple = ?",
(bar_value[0], tuple[0], tuple[1]))
conn.commit()
答案 0 :(得分:7)
if 'foo' in line:
检查应该在for tuple in tuples:
迭代器之前,因此您将自动跳过不需要处理的行
循环之外的第二个小改进 - compile regexp并使用编译的匹配器。
答案 1 :(得分:5)
不幸的是,您无法收紧for
循环,因为您需要遍历文件中每一行的所有元组。但是,您可以通过合并if
语句来略微收紧代码。在迭代所有元组之前,您应该检查是否存在'foo'
。
with open('connection_details.txt', 'r') as f:
c.execute('BEGIN TRANSACTION')
# for each line in ~16k lines
for line in f:
# for each row returned from sql query
if 'foo' in line:
for tup in tuples:
if tup[0] in line and tup[1] in line:
答案 2 :(得分:1)
对于for
循环,您可以使用itertools
,然后您可以将if
语句转换为单个语句,如下所示:
import itertools
for line, tuple in itertools.product(f, tuples):
if tuple[0] in line and tuple[1] in line and 'foo' in line: