是否需要使用awk预处理文件,或者可以直接在R中完成?

时间:2015-11-16 19:09:35

标签: r csv awk

我以前用awk处理csv文件,这是我的第一个脚本:

tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2} {if($2!=old){print $0; old=$2;}}' | less

此脚本在第2列中查找重复值(如果第n行的值与第n + 1行,n + 2 ...相同),则仅打印第一次出现的值。例如,如果您输入以下输入:

ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

然后输出将是:

1,0,0,1.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

修改 我添加第二个脚本让我有点挑战:

第二个脚本执行相同操作但打印最后一次重复:

tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2; line=$0} {if($2==old){line=$0}else{print line; old=$2; line=$0}} END {print $0}' | less

它的输出将是:

22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

我认为R是应该处理此类任务的强大语言,但我发现只有从R等调用awk脚本的问题。如何在R中执行此操作?

2 个答案:

答案 0 :(得分:5)

关于你的问题的更新,一个更通用的解决方案,感谢@nicola:

Idx.first <- c(TRUE, tbl$orig[-1] != tbl$orig[-nrow(tbl)])
##
R> tbl[Idx.first,]
#    ord orig pred as o.p
# 1    1    0    0  1   0
# 23  23    4    0  0   4
# 24  24  402    0  1 402
# 25  25    0    0  1   0

如果您想在运行中使用最后出现的值,而不是第一个,只需将TRUE附加到@ nicola&#39;索引表达式而不是在它前面加上:

Idx.last <- c(tbl$orig[-1] != tbl$orig[-nrow(tbl)], TRUE)
##
R> tbl[Idx.last,]
#    ord orig pred as o.p
# 22  22    0    0  0   0
# 23  23    4    0  0   4
# 24  24  402    0  1 402
# 25  25    0    0  1   0

在任何一种情况下,tbl$orig[-1] != tbl$orig[-nrow(tbl)]都将第2列中的第2到第n个值与第2列中的第1到第n个值进行比较。结果是一个逻辑向量,其中TRUE个元素表示连续值的变化。由于比较的长度为n-1,因此将额外的TRUE值推到前面(案例1)将选择运行中的第一个匹配项,而在后面添加额外的TRUE(案例2) )将选择运行中的最后一次出现。

数据:

tbl <- read.table(text = "ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0",
header = TRUE,
sep = ",")

答案 1 :(得分:4)

对于(更新的)问题,您可以使用例如(感谢@nrussell的评论和建议):

idx <- c(1, cumsum(rle(tbl[,2])[[1]])[-1])
tbl[idx,]
#   ord orig pred as o.p x
#1    1    0    0  1   0 1
#23  23    4    0  0   4 2
#24  24  402    0  1 402 3
#25  25    0    0  1   0 4

它将返回列orig中每个“块”的相同值的第一行。

  • rle(tbl[,2])[[1]]计算列orig
  • 中显示的每个新值(不同于上一个)值的运行长度
  • cumsum(...)计算这些游程长度的累积总和
  • 最后,c(1, cumsum(...)[-1])用1替换该向量中的第一个数字,以便数据的第一行始终存在