我有两组数据。
第一个数据集如下:
Storm_ID,Cell_ID,Wind_speed
2,10236258,27
2,10236300,58
2,10236301,25
3,10240400,51
第二个数据集如下:
Storm_ID,Cell_ID,Storm_surge
2,10236299,0.27
2,10236300,0.27
2,10236301,0.35
2,10240400,0.35
2,10240401,0.81
4,10240402,0.11
现在我想要一个看起来像这样的输出:
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11
我在Linux中尝试了join命令来执行此任务并且失败了。 Join命令跳过了数据库中没有匹配的行。我可以使用Matlab,但数据大小超过100 GB,这使得这项任务非常困难。 请有人请指导我这个。我可以使用SQL或python来完成此任务。感谢您的帮助谢谢。
答案 0 :(得分:1)
我想你想要一个full outer join
:
select storm_id, cell_id,
coalesce(d1.wind_speed, 0) as wind_speed,
coalesce(d2.storm_surge, 0) as storm_surge
from dataset1 d1 full join
dataset2 d2
using (storm_id, cell_id);
答案 1 :(得分:0)
仅限Shell解决方案
首先备份文件
假设您的文件名为wind1.txt和wind2.txt
您可以应用以下shell命令集:
perl -pi -E "s/,/_/" wind*
perl -pi -E 's/(.$)/$1,0/' wind1.txt
perl -pi -E "s/,/,0,/" wind2.txt
join --header -a 1 -a 2 wind1.txt wind2.txt > outfile.txt
中间结果
Storm_ID_Cell_ID,Wind_speed,0
2_10236258,27,0
2_10236299,0,0.27
2_10236300,0,0.27
2_10236300,58,0
2_10236301,0,0.35
2_10236301,25,0
2_10240400,0,0.35
2_10240401,0,0.81
3_10240400,51,0
4_10240402,0,0.11
现在将第0行重命名为“storm_surge”,将第一个_替换为“,”数字
perl -pi -E "s/Wind_speed,0/Wind_speed,Storm_surge/" outfile.txt
perl -pi -E 's/^(\d+)_/$1,/' outfile.txt
perl -pi -E "s/Storm_ID_Cell_ID/Storm_ID,Cell_ID/" outfile.txt
中级结果:
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,0,0.27
2,10236300,58,0
2,10236301,0,0.35
2,10236301,25,0
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11
最后运行:
awk 'BEGIN { FS=OFS=SUBSEP=","}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' outfile.txt | sort
(对不起 - Q在回答时被关闭)
答案 2 :(得分:0)
awk -F,-v OFS =,' {x = $ 1"," $ 2} FNR == NR {a [x] = $ 3; b [x] = 0; next} {b [x] = $ 3}!a [x] {a [x] = 0} END {for(i in a)print i,a [i],b [i]}' f1 f2 | sort -n
由于它是一个循环,awk产生随机顺序。因此最后排序。