Question

我正在寻找我正在进行的技能改进。

任务非常简单：从.txt文件到SQL数据库。

所以txt文件包含一堆看起来像这样的行：

200101 35.922 2.127 1.182 1.182 1.418 1.654

说明：

200111：是信息，包含在20 （频道号码） 01 （页码） {{1} } （代码）

其余的双精度值只是值：I1，I2 ......直到I6

因此，SQL文件将具有列11

问题在于，在txt文件中，[channel, page, code, I1, I2, I3, I4, I5, I6, passed]可以是00,11,10,01或22，并且根据代码，我需要使用I的值执行一个或另一个操作决定传递= 1或传递= 0。例如，在这种情况下，如果code，code=11

txt上的行按代码排序。

所以，有了这个解释，我基本上做的就是这样：

passed=1 if I1>I3 and I6<1

那么，使用with open(txtFile, 'r') as txt: for line in txt: currentLine = line.split(' ')[0] if currentLine.endswith('00'): #do some actions here if currentLine.endswith('01'): #do some actions here #... #and so on # and of course write to SQL file子句检查每一行是否更好或更具时间效率

Answer 1

只做一次拆分，你可能会得到一些非常小的改进：

currentLine = line.split(' ', 1)[0]

或者如果您感兴趣的第一个字段总是具有相同的长度（使用您的示例为6），您可以尝试仅获取这些字符：

currentLine = line[:6]

如果第一个字段的长度是可变的，你可以试试这个：

currentLine = line[:line.index(' ')]

这里有一些时间可以看出哪个更快......

您当前的方法：

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "lineCode = l.split(' ')[0]"
1000000 loops, best of 3: 0.61 usec per loop

第一个建议（限制拆分为一次）：

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "lineCode = l.split(' ', 1)[0]"
1000000 loops, best of 3: 0.237 usec per loop

第二个建议（使用切片获取固定长度字段）：

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l[:6]"                                                                                             
10000000 loops, best of 3: 0.0708 usec per loop

第三个建议（使用slice + index获取可变长度字段）：

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l[:l.index(' ')]"
1000000 loops, best of 3: 0.208 usec per loop

在我的初级测试中，如果你能管理它，似乎建议2是最快的。其他两个建议在性能上非常相似，但比你当前的方法要好得多。

显然，这些时间将根据您运行它们的平台而有所不同，但相对而言，性能改进应该随处可见。

现在，所有这一切，我同意你的其他评论员的说法，你的缓慢可能来自其他地方。如果我不得不猜测它将是你的SQL INSERT。我建议做的唯一事情是，如果数据库和驱动程序允许它，或者将SQL语句写入格式正确的文件，并让其他工具进行批量导入（甚至可以使用Python子进程模块调用），则需要多次INSERT。 / p>

其他想法

如果您只需要测试这两个字符（第5和第6个），那么这是我找到的最有效的字符。它消除了您使用的低效split和较慢的endswith。

此致：

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l.split(' ')[0]; currentLine.endswith('00')"                                                       
1000000 loops, best of 3: 0.72 usec per loop

更好：

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l[:6]; lineCode = currentLine[4:]; lineCode == '00'"
10000000 loops, best of 3: 0.161 usec per loop

最佳：

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l[4:6]; currentLine == '00'"                                                                       
10000000 loops, best of 3: 0.102 usec per loop

所以，你可以这样做：

with open(txtFile, 'r') as txt: 
for line in txt:
    currentLine = line[4:6]
    if currentLine == '00':
        #do some actions here
    elif currentLine == '01':
        #do some actions here
    #...
    #and so on
    # and of course write to SQL file

读取txt文件时的性能提升

1 个答案:

其他想法