我有两个csv文件,每个文件包含两列。
file1.csv
C2-C1 1.5183
C3-C2 1.49
C3-C1 1.4991
O4-C3 1.4104
C1-C2-C3 59.78
file2.csv
C2-C1 1.5052
C3-C2 1.505
C3-C1 1.5037
S4-C3 1.7976
C1-C2-C3 59.95
我在输出文件中打印三列: column-1:相似的行,然后是不同的行
第2列和第3列:分别是file1.csv和file2.csv中第二列的值。
所需的output.csv
C2-C1 1.5183 1.5052
C3-C2 1.49 1.505
C3-C1 1.4991 1.5037
C1-C2-C3 59.78 59.95
O4-C3 1.4104 -
S4-C3 - 1.7976
我试过“itertools”,我找不到差异线的合适格式。
import itertools
files = ['1.csv', '2.csv']
d = {}
for fi, f in enumerate(files):
fh = open(f)
for line in fh:
sl = line.split()
name = sl[0]
val = float(sl[1])
if name not in d:
d[name] = {}
if fi not in d[name]:
d[name][fi] = []
d[name][fi].append(val)
fh.close()
for name, vals in d.items():
if len(vals) == len(files):
for var in itertools.product(*vals.values()):
if max(var) - min(var) <= 20:
out1 = '{}\t{}'.format(name, "\t".join(map(str, var)))
print(out1)
break
for name, vals in d.items():
if len(vals) != len(files):
for var in itertools.product(*vals.values()):
if max(var) - min(var) <= 20:
out2 = '{}\t{}'.format(name, "\t".join(map(str, var)))
print(out2)
break
我的输出:
C2-C1 1.5183 1.5052
C3-C2 1.49 1.505
C3-C1 1.4991 1.5037
C1-C2-C3 59.78 59.95
O4-C3 1.4104
S4-C3 1.7976
答案 0 :(得分:3)
关注awk
可能对您有所帮助,这可能会在Input_file中处理重复的项目。
awk '
FNR==NR{
a[$1]=$2;
next}
NF{
printf("%s %s %s\n",$1,$1 in a?a[$1]:"-",$2);
b[$1]=$1 in a?$1:""
}
END{
for(i in a){
if(!b[i] || b[i]==""){ print i,a[i],"-" }}
}' file1.csv file2.csv | column -t
答案 1 :(得分:2)
纯Python解决方案,可以根据需要使用尽可能多的文件(它将为每个文件添加一个新列,并根据共享相同第一列值的文件数进行排序)。作为奖励,它实际上使用适当的CSV解析,因此它可以处理多种CSV格式,几乎没有改变:
import csv
files = ["1.csv", "2.csv"] # as many files as you want
results = [] # a store for our final result
line_map = {} # store a map for a quick update lookup
for i, f in enumerate(files): # enumerate the file list and iterate over it
with open(f, newline="") as f_in: # open(f, "rb") on Python 2.x
reader = csv.reader(f_in, delimiter=" ") # proper CSV reader, assumed space delimiter
for row in reader: # iterate over the current CSV line by line
row_id = row[0] # extract the first column for easier access
if row_id not in line_map: # a column value encountered for the first time...
line_map[row_id] = [row_id] + ["-"] * len(files) # create a placeholder list
results.append(line_map[row_id]) # add the value on its own column
line_map[row[0]][i+1] = row[1] # save the value in its place in the results list
# now we need to bracket the results in order of number of values before writing
# the easiest way is to just sort based on the amount of blank spaces
results = sorted(results, key=lambda x: x.count("-"))
现在,如果您只想打印它:
for r in results:
print("\t".join(r))
# C2-C1 1.5183 1.5052
# C3-C2 1.49 1.505
# C3-C1 1.4991 1.5037
# C1-C2-C3 59.78 59.95
# O4-C3 1.4104 -
# S4-C3 - 1.7976
或者,如果您想将其实际保存到格式正确的CSV文件中:
with open("output.csv", "w", newline="") as f: # open(f, "wb") on Python 2.x
writer = csv.writer(f, delimiter="\t") # a proper CSV writer, tab used as a delimiter
writer.writerows(results)
答案 2 :(得分:2)
GNU awk解决方案使用2d数组,ARGIND
和column -t
进行漂亮的打印。它支持两个以上的文件:
$ awk '
{ a[$1][ARGIND]=$2 } # hash to 2d array
END {
for(i in a) { # iterate all a
printf "%s",i # output key
for(j=1;j<=ARGIND;j++) # iterate all data in a
printf "%s%s", OFS, (a[i][j]==""?"-":a[i][j]) # output
print "" # finish with a newline
}
}' file1 file2 file1 file2 | column -t # pretty print
C1-C2-C3 59.78 59.95 59.78 59.95
O4-C3 1.4104 - 1.4104 -
S4-C3 - 1.7976 - 1.7976
C3-C1 1.4991 1.5037 1.4991 1.5037
C3-C2 1.49 1.505 1.49 1.505
C2-C1 1.5183 1.5052 1.5183 1.5052
答案 3 :(得分:2)
$ cat tst.awk
NR==FNR {
file2[$1] = $2
next
}
{
print $0, ($1 in file2 ? file2[$1] : "-")
delete file2[$1]
}
END {
for (key in file2) {
print key, "-", file2[key]
}
}
$ awk -f tst.awk file2.csv file1.csv | column -t
C2-C1 1.5183 1.5052
C3-C2 1.49 1.505
C3-C1 1.4991 1.5037
O4-C3 1.4104 -
C1-C2-C3 59.78 59.95
S4-C3 - 1.7976
答案 4 :(得分:1)
Awk
解决方案:
awk 'NR == FNR{ a[$1] = $2; next }
{
if ($1 in a) { print $1, $2, a[$1]; delete a[$1] }
else a[$1] = $2 OFS "-"
}
END{
for (i in a) print i, (a[i] ~ /-$/ ? a[i] : "-" OFS a[i])
}' file2.csv file1.csv | column -t
输出:
C2-C1 1.5183 1.5052
C3-C2 1.49 1.505
C3-C1 1.4991 1.5037
C1-C2-C3 59.78 59.95
O4-C3 1.4104 -
S4-C3 - 1.7976
答案 5 :(得分:0)
如果你不介意使用熊猫,它会让生活变得更轻松:
import pandas as pd
df1=pd.DataFrame({'num01':[1.5183,1.49,1.4991,1.4104,59.78]},
index=['C2-C1','C3-C2','C3-C1','O4-C3','C1-C2-C3'])
df2=pd.DataFrame({'num02':[1.5183,1.49,1.4991,1.4104,59.78]},
index=['C2-C1','C3-C2','C3-C1','S4-C3','C1-C2-C3'])
df=pd.concat([df1,df2],axis=1).replace('nan','-')
您可以轻松地将您的csvs读入熊猫,并且不必处理awk代码。
index num01 num02
C1-C2-C3 59.78 59.78
C2-C1 1.5183 1.5183
C3-C1 1.4991 1.4991
C3-C2 1.49 1.49
O4-C3 1.4104 -
S4-C3 - 1.4104
答案 6 :(得分:0)
Python defaultdict
可以解决这个问题,前提是默认值是n个值的列表:
files = ['1.csv', '2.csv']
d = collections.defaultdict(lambda x: ['-'] * len(files))
for fi, f in enumerate(files):
with open(f) as fd:
for line in fh:
sl = line.split()
name = sl[0]
val = float(sl[1])
d[name][fi] = val
fmt = "{:<12}" + "{:<12}" * len(files)
for k, val in d.items():
print(fmt.format(k, *val))