我以前用awk处理csv文件,这是我的第一个脚本:
tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2} {if($2!=old){print $0; old=$2;}}' | less
此脚本在第2列中查找重复值(如果第n行的值与第n + 1行,n + 2 ...相同),则仅打印第一次出现的值。例如,如果您输入以下输入:
ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
然后输出将是:
1,0,0,1.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
修改 我添加第二个脚本让我有点挑战:
第二个脚本执行相同操作但打印最后一次重复:
tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2; line=$0} {if($2==old){line=$0}else{print line; old=$2; line=$0}} END {print $0}' | less
它的输出将是:
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
我认为R是应该处理此类任务的强大语言,但我发现只有从R等调用awk脚本的问题。如何在R中执行此操作?
答案 0 :(得分:5)
关于你的问题的更新,一个更通用的解决方案,感谢@nicola:
Idx.first <- c(TRUE, tbl$orig[-1] != tbl$orig[-nrow(tbl)])
##
R> tbl[Idx.first,]
# ord orig pred as o.p
# 1 1 0 0 1 0
# 23 23 4 0 0 4
# 24 24 402 0 1 402
# 25 25 0 0 1 0
如果您想在运行中使用最后出现的值,而不是第一个,只需将TRUE
附加到@ nicola&#39;索引表达式而不是在它前面加上:
Idx.last <- c(tbl$orig[-1] != tbl$orig[-nrow(tbl)], TRUE)
##
R> tbl[Idx.last,]
# ord orig pred as o.p
# 22 22 0 0 0 0
# 23 23 4 0 0 4
# 24 24 402 0 1 402
# 25 25 0 0 1 0
在任何一种情况下,tbl$orig[-1] != tbl$orig[-nrow(tbl)]
都将第2列中的第2到第n个值与第2列中的第1到第n个值进行比较。结果是一个逻辑向量,其中TRUE
个元素表示连续值的变化。由于比较的长度为n-1,因此将额外的TRUE
值推到前面(案例1)将选择运行中的第一个匹配项,而在后面添加额外的TRUE
(案例2) )将选择运行中的最后一次出现。
数据:
tbl <- read.table(text = "ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0",
header = TRUE,
sep = ",")
答案 1 :(得分:4)
对于(更新的)问题,您可以使用例如(感谢@nrussell的评论和建议):
idx <- c(1, cumsum(rle(tbl[,2])[[1]])[-1])
tbl[idx,]
# ord orig pred as o.p x
#1 1 0 0 1 0 1
#23 23 4 0 0 4 2
#24 24 402 0 1 402 3
#25 25 0 0 1 0 4
它将返回列orig
中每个“块”的相同值的第一行。
rle(tbl[,2])[[1]]
计算列orig
cumsum(...)
计算这些游程长度的累积总和c(1, cumsum(...)[-1])
用1替换该向量中的第一个数字,以便数据的第一行始终存在