我有2个大分隔文件。
需要帮助:
a)我需要根据两个文件的键列获得行数
b)根据两个文件中的键列找到重复项
c)从两个文件中获取重复计数。
d)重复项应创建为单独文件
e)在两个文件中获取常用记录。
f)排序两个文件(常见记录)
g)对文件进行排序后进行比较,得到不匹配计数。
h)不匹配记录应作为单独的文件创建。
非常感谢任何帮助。
答案 0 :(得分:0)
由于您必须对文件进行排序,因此必须将它们加载到内存中,您可以执行以下操作:
#see @ http://www.grantjenks.com/docs/sortedcontainers/ for information about sorted containers.
#they are efficient for huge data.
from sortedcontainers import SortedList, SortedDict
file1=SortedList()
file2=SortedList()
delimiter = ";"
commentSign="#"
path1="./data1"
path2="./data2"
def get_values_column(delimited_lines, column_number, delimiter, commentSign):
values = set()
for line in delimited_lines:
if line[0] != commentSign:
fields = line.split(delimiter)
values.add(fields[column_number])
return values
def count_not_in_other(collection1, collection2):
uniq1 = []
uniq2 = []
for elem in collection1:
if elem not in collection2:
uniq1.append(elem)
for elem in collection2:
if elem not in collection1:
uniq2.append(elem)
return (uniq1,uniq2)
def from_line_list_to_line_count(line_list):
lines = SortedDict()
for line in line_list:
if line not in lines.keys():
lines[line] = 0
lines[line] += 1
return lines
def duplicated_lines(line_list):
lines_count = from_line_list_to_line_count(line_list)
return list(filter( lambda x: lines_count[x]>1, lines_count.keys()))
if __name__ == "__main__":
with open(path1, "r") as io1, open(path2,"r") as io2 :
#copy the file in memory and sort them.
for line in io1:
file1.add(line)
for line in io2:
file2.add(line)
with open(path1, "w") as io1, open(path2,"w") as io2 :
#rewrite sorted files
for line in file1:
io1.write(line)
for line in file2:
io2.write(line)
print("There is {0} different key value in {1}".format(len(get_values_column(file1, 0, delimiter, commentSign)), path1))
print("There is {0} different key value in {1}".format(len(get_values_column(file2, 0, delimiter, commentSign)), path2))
uniques = count_not_in_other(file1, file2)
print("There is {0} lines present in file1 that are not present in file2".format(len(uniques[0])))
print("There is {0} lines present in file2 that are not present in file1".format(len(uniques[1])))
print("file1 duplicated lines are : {0}".format(duplicated_lines(file1)))
print("file2 duplicated lines are : {0}".format(duplicated_lines(file2)))
我用这个数据: DATA1
#id; name; value
10; foo; 100
10; foo; 100
10; foo; 101
11; foo; 50
13; bar; 500
DATA2
#id; name; value
10; foo; 100
11; foo; 50
13; bar; 500
18; bar foo; 46
10; foo; 100
10; foo; 101
18; bar foo; 46
当你要求一份非常完整的工作而不提供你之前尝试过的任何线索时,我只会告诉你这件事。尝试理解代码并完成它。现在,您对文件进行排序并获取每个文件的(数量)键。
注意:我没有以任何方式加入sortedtatainers库。