Python记录级别比较2个大分隔文件

时间:2014-09-02 06:29:49

标签: python csv python-3.x

我有2个大分隔文件。

需要帮助:

a)我需要根据两个文件的键列获得行数

b)根据两个文件中的键列找到重复项

c)从两个文件中获取重复计数

d)重复项应创建为单独文件

e)在两个文件中获取常用记录

f)排序两个文件(常见记录)

g)对文件进行排序后进行比较,得到不匹配计数

h)不匹配记录应作为单独的文件创建。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

由于您必须对文件进行排序,因此必须将它们加载到内存中,您可以执行以下操作:

#see @ http://www.grantjenks.com/docs/sortedcontainers/ for information about sorted containers.
#they are efficient for huge data.
from sortedcontainers import SortedList, SortedDict

file1=SortedList()
file2=SortedList()

delimiter = ";"
commentSign="#"
path1="./data1"
path2="./data2"

def get_values_column(delimited_lines, column_number, delimiter, commentSign):
    values = set()
    for line in delimited_lines:
        if line[0] != commentSign:
            fields = line.split(delimiter)
            values.add(fields[column_number])
    return values

def count_not_in_other(collection1, collection2):
    uniq1 = []
    uniq2 = []
    for elem in collection1:
        if elem not in collection2:
            uniq1.append(elem)

    for elem in collection2:
        if elem not in collection1:
            uniq2.append(elem)

    return (uniq1,uniq2)

def from_line_list_to_line_count(line_list):
    lines = SortedDict()

    for line in line_list:
        if line not in lines.keys():
            lines[line] = 0
        lines[line] += 1

    return lines

def duplicated_lines(line_list):
    lines_count = from_line_list_to_line_count(line_list)
    return list(filter( lambda x: lines_count[x]>1, lines_count.keys()))

if __name__ == "__main__":

    with open(path1, "r") as io1, open(path2,"r") as io2 :
        #copy the file in memory and sort them.
        for line in io1:
            file1.add(line)

        for line in io2:
            file2.add(line)

    with open(path1, "w") as io1, open(path2,"w") as io2 :
        #rewrite sorted files
        for line in file1:
            io1.write(line)

        for line in file2:
            io2.write(line)

        print("There is {0} different key value in {1}".format(len(get_values_column(file1, 0, delimiter, commentSign)), path1))
        print("There is {0} different key value in {1}".format(len(get_values_column(file2, 0, delimiter, commentSign)), path2))

        uniques = count_not_in_other(file1, file2)
        print("There is {0} lines present in file1 that are not present in file2".format(len(uniques[0])))
        print("There is {0} lines present in file2 that are not present in file1".format(len(uniques[1])))


        print("file1 duplicated lines are : {0}".format(duplicated_lines(file1)))
        print("file2 duplicated lines are : {0}".format(duplicated_lines(file2)))       

我用这个数据: DATA1

#id; name; value
10; foo; 100
10; foo; 100
10; foo; 101
11; foo; 50
13; bar; 500

DATA2

#id; name; value
10; foo; 100
11; foo; 50
13; bar; 500
18; bar foo; 46
10; foo; 100
10; foo; 101
18; bar foo; 46

当你要求一份非常完整的工作而不提供你之前尝试过的任何线索时,我只会告诉你这件事。尝试理解代码并完成它。现在,您对文件进行排序并获取每个文件的(数量)键。

注意:我没有以任何方式加入sortedtatainers库。