Question

ROW_ID|Quote Number|Status|Status Reason ADT|Name|Account|Alias ADT|......etc 418 Columns

我有一个大的管道定界文件，大约200万行（2gs）。该文件应该每行有418列（|），但是，不必要地拆分了许多行，从而导致导入数据时出现问题。

导入时，我想合并行，直到下一行的管道数==418。

大多数问题发生在第90列，然后是328行。其他的问题在90处拆分，然后是几行0，然后是328。理想情况下，所有这些行都应合并为一个。

For example, the highlighted rows should be combined into one (rows with 0 still contain information

I thought about appending the incorrect rows to a list, and then combining them, but at 1 second per row, this would take approximately 26 days to complete.

我还尝试在追加之前合并这两行，但恐怕会遇到相同的效率问题。

%%time

correct = []
incorrect = []

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
    for index, line in enumerate(fh):
        if index<20:
            if line.count('|')!=418:
                incorrect.append(line)

Answer 1

当您使用枚举时，它会打开，它会花费更长的时间。我删除了代码的那部分，将所有问题行附加到列表中。与我预期的26天相比，花了34秒。然后，我加入了整个列表，每418个管道将列表分割一次

correct = []
incorrect = []

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
    for line in fh:
        if line.count('|')==418:
                correct.append(line)
        if line.count('|')!=418:
                incorrect.append(line)
                
                
test_1 = ' '.join(incorrect)                
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]

如果行不符合条件，则在导入txt文件时合并行

1 个答案: