如果行不符合条件,则在导入txt文件时合并行

时间:2018-11-27 18:51:41

标签: python

ROW_ID|Quote Number|Status|Status Reason ADT|Name|Account|Alias ADT|......etc 418 Columns

我有一个大的管道定界文件,大约200万行(2gs)。该文件应该每行有418列(|),但是,不必要地拆分了许多行,从而导致导入数据时出现问题。

导入时,我想合并行,直到下一行的管道数==418。

大多数问题发生在第90列,然后是328行。其他的问题在90处拆分,然后是几行0,然后是328。理想情况下,所有这些行都应合并为一个。

For example, the highlighted rows should be combined into one (rows with 0 still contain information

I thought about appending the incorrect rows to a list, and then combining them, but at 1 second per row, this would take approximately 26 days to complete.

我还尝试在追加之前合并这两行,但恐怕会遇到相同的效率问题。

%%time

correct = []
incorrect = []

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
    for index, line in enumerate(fh):
        if index<20:
            if line.count('|')!=418:
                incorrect.append(line)

1 个答案:

答案 0 :(得分:1)

当您使用枚举时,它会打开,它会花费更长的时间。我删除了代码的那部分,将所有问题行附加到列表中。与我预期的26天相比,花了34秒。然后,我加入了整个列表,每418个管道将列表分割一次

correct = []
incorrect = []

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
    for line in fh:
        if line.count('|')==418:
                correct.append(line)
        if line.count('|')!=418:
                incorrect.append(line)
                
                
test_1 = ' '.join(incorrect)                
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]