根据某些条件拆分大文本文件

时间:2019-12-23 15:47:01

标签: python python-3.x large-files

我有一个非常大的文本文件(〜80G),大约有十亿行。 该文件的内容的示例示例(第一列代表行号,而不是文件内容的一部分)将是:

注意:文件的内容首先在第一列中排序,然后在第二列中排序。

1 60400 60420 12 14 123 144
2 60400 60520 11 14 123 144
...
i 60420 60400 10 11 233 341
i+1 60420 60410 14 20 244 268
...

过滤条件: 我想根据唯一的(id1,id2) [或(id2,id1)] 对拆分文件,如果我考虑< strong>(60400,60420)作为ID对,则行也将属于该对。因此,拆分文件将包含属于此类唯一ID对的所有行。因此,所有拆分文件都将是ID对的唯一文件。到目前为止,我已应用的方法如下:

1)将所有唯一ID对划分为三个文件,其中前两个文件具有2亿个唯一ID,第三个文件具有1.57亿个左右唯一ID。创建这些id对的原因是id1

2)对于每个分区ID,我都会像这样再次对其进行分区。

partition_ids = []
# read the partition id and populate the partition_ids

# read the original file(87G file)
for line in original_file:
    # parse the line
    toks = line.split()
    id1 = int(toks[0])
    id2 = int(toks[1])

    # create the unique id pair key
    if id1 < id2:
        key = str(id1)+','+str(id2)
    else:
        key = str(id2)+','+str(id2)

    if key in partition_ids[:40mil]: #(short hard for first 40 million unique ids, just for purpose of explaining)
        # write line to the file

此过程仍然需要花费我很长时间(> 20小时),我真的想加快此过程。这是我想到的处理大文件的解决方案。如果还有其他方法或建议(更快),将不胜感激。

2 个答案:

答案 0 :(得分:1)

尝试使用dict更改列表partition_ids :(以减少列表中的成本元素)

partition_ids = {}
# read the partition id and populate the partition_ids

# read the original file(87G file)
for line in original_file:
    # parse the line
    toks = line.split()
    id1 = int(toks[0])
    id2 = int(toks[1])

    # create the unique id pair key
    if id1 < id2:
        key = str(id1)+','+str(id2)
    else:
        key = str(id2)+','+str(id2)

    #YOUR OLD CODE
    """
    if key in partition_ids[:40mil]: #(short hard for first 40 million unique ids, just for purpose of explaining)
    # write line to the file
    """

    #MY propose
    if key in partition_ids:
        #Do your stuf if it exists



   #To asign keys when you want, cause you miss that part on your code
   partition_ids[key] = True        

答案 1 :(得分:0)

基准:

Fast python (sets, rb,wb) with partitions:  3.75 s
Fast python (sets... with an internal loop: 1.39 s
Original python with partitions:           19.4 s
Original python ... with an internal loop: 23.4 s

Original python ... with internal loop:   
Cython:                                       512 ms
Python with sets and binary read and write:   820 ms
Python with dicts (Wonka's variant):        1.31 s
Original Python                            12.1 s

在列表的一部分上使用集合也有利于提高速度。

带有分区的快速python(集合,rb,wb):

for i,partition_ids in enumerate(l_partition_ids):
    partition_ids_s = set(partition_ids)
    with open("in.txt", "rb") as in_file:
        with open(f"out{i}.txt", "wb") as out_file:
            for line in in_file:
                # parse the line
                toks = line.split()
                id1 = int(toks[1])
                id2 = int(toks[2])

                # create the unique id pair key
                if id1 < id2:
                    key = b"%d,%d" % (id1,id2)
                else:
                    key = b"%d,%d" % (id2,id1)

                if key in partition_ids_s:
                    out_file.write(line)

快速python(设置...带有内部循环:

out_files = []
l_partition_ids_sets = [set(x) for x in l_partition_ids]
with open("in.txt", "rb") as in_file:
    for i in range(len(l_partition_ids)):
        out_files.append(open(f"out{i}.txt", "wb"))
    for line in in_file:
        # parse the line
        toks = line.split()
        id1 = int(toks[1])
        id2 = int(toks[2])

        # create the unique id pair key
        if id1 < id2:
            key = b"%d,%d" % (id1,id2)
        else:
            key = b"%d,%d" % (id2,id1)

        for i,partition_ids in enumerate(l_partition_ids_sets):
            if key in partition_ids:
                out_files[i].write(line)
for out_file in out_files:
    out_file.close()

带有分区的原始python:

for i,partition_ids in enumerate(l_partition_ids):
    with open("in.txt", "r") as in_file:
        with open("out.txt", "w") as out_file:
            for line in in_file:
                # parse the line
                toks = line.split()
                id1 = int(toks[1])
                id2 = int(toks[2])

                # create the unique id pair key
                if id1 < id2:
                    key = str(id1)+','+str(id2)
                else:
                    key = str(id2)+','+str(id1)

                if key in partition_ids: #(short hard for first 40 million unique ids, just for purpose of explaining)
                    out_file.write(line)

在下面的line_profiler中,我们可以看到分割行并转换为整数大约需要花费45%的时间。阅读仅花费11%的时间。在cython(fast_atoi here)中实现了更快的整数转换,但是我这里没有实现。我试图提高cython中line.split()的速度,但没有成功。

Cython(最快的变体):

%%cython

from libc.stdint cimport (uint8_t, uint16_t, uint32_t, uint64_t,
                          int8_t, int16_t, int32_t, int64_t)
import numpy as np

def f_set_cy(partition_ids):
    cdef int64_t id1, id2
    partition_ids_s = set(x.encode() for x in partition_ids)
    with open("in.txt", "rb") as in_file:
        with open("out.txt", "wb") as out_file:
            for line in in_file:
                # parse the line
                toks = line.split()
                id1 = int(toks[1])
                id2 = int(toks[2])

                # create the unique id pair key
                if id1 < id2:
                    key = b"%d,%d" % (id1,id2)
                else:
                    key = b"%d,%d" % (id2,id1)


                if key in partition_ids_s:
                    out_file.write(line)

带有集合和二进制读写的Python:

partition_ids_s = set(x.encode() for x in partition_ids)
with open("in.txt", "rb") as in_file:
    with open("out.txt", "wb") as out_file:
        for line in in_file:
            # parse the line
            toks = line.split()
            id1 = int(toks[1])
            id2 = int(toks[2])

            # create the unique id pair key
            if id1 < id2:
                key = b"%d,%d" % (id1,id2)
            else:
                key = b"%d,%d" % (id2,id1)


            if key in partition_ids_s:
                out_file.write(line)

线路分析器:

Timer unit: 1e-07 s

Total time: 2.67841 s
File: <ipython-input-157-900077df3ca6>
Function: f_py at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def f_py(partition_ids):
     2         1      10037.0  10037.0      0.0      partition_ids_s = set(x.encode() for x in partition_ids)
     3         1       2877.0   2877.0      0.0      with open("in.txt", "rb") as in_file:
     4         1       9213.0   9213.0      0.0          with open("out.txt", "wb") as out_file:
     5    500001    2914824.0      5.8     10.9              for line in in_file:
     6                                                           # parse the line
     7    500000    4207575.0      8.4     15.7                  toks = line.split()
     8    500000    3891864.0      7.8     14.5                  id1 = int(toks[1])
     9    500000    3768049.0      7.5     14.1                  id2 = int(toks[2])
    10                                           
    11                                                           # create the unique id pair key
    12    500000    2798327.0      5.6     10.4                  if id1 < id2:
    13    300000    2768751.0      9.2     10.3                      key = b"%d,%d" % (id1,id2)
    14                                                           else:
    15    200000    1844449.0      9.2      6.9                      key = b"%d,%d" % (id2,id1)
    16                                           
    17                                           
    18    500000    3008688.0      6.0     11.2                  if key in partition_ids_s:
    19    200000    1559435.0      7.8      5.8                      out_file.write(line)

数据初始化:

import pandas as pd
import io
from random import shuffle

s= """60300 60420 12 14 123 144
60400 60420 12 14 123 144
60400 60520 11 14 123 144
60420 60400 10 11 233 341
60420 60410 14 20 244 268
"""
s = s * 100000  
df = pd.read_csv(io.StringIO(s), sep=" ", names=["id1", "id2", "a1", "a2", "a3", "a4"])
df = df.reset_index()[["index"] + list(df.columns[:-1])] 
df.to_csv("in.txt", sep=" ", index=False, header=False) #500000 lines 14MB
partition_ids = [str(x)+","+str(x+20) for x in range(0, 500000,200)] #2500 elements

对于多个分区:

partition_ids = [str(x)+","+str(x+20) for x in range(0, 500000,200)] #2500 elements
shuffle(partition_ids)
l_partition_ids = l_split(partition_ids, 5)

使用二进制字符串:

partition_ids = [b"%d,%d" % (x,x+20) for x in range(0, 500000,200)] #2500 elements
shuffle(partition_ids)
l_partition_ids = l_split(partition_ids, 5)