我需要对每个分组符合特定条件的行之间的行执行检查,并将值存储在新列中。问题是我将文本匹配中间行与满足条件的行并计算条件行和匹配行之间的行数。如果条件行与组的结尾或下一个条件行之间没有匹配的行,则该值仅为-1。
# Determine rows where num = 1
num_ind<- which(df$num==1)
# Get up to 5 following rows between rows where num = 1
df<- df[unique(sort(num_ind + rep(0:5, each = length(num_ind)))),]
**Example data:**
id id2 num text text2
1 1 1 "a" ""
1 1 0 "" "b"
1 1 0 "" "a"
1 1 1 "" ""
1 2 1 "a" ""
1 2 0 "" "b"
1 2 0 "" "b"
1 2 0 "" "b"
2 1 0 "" "a"
2 1 0 "" "b"
2 1 0 "" "b"
3 1 1 "" ""
......
# for each group in grouped_by(id,id2)
# get rows in between rows where num = 1
# compare the text2 in each following row to text in the num=1 row
# create a column that shows how many following rows that takes
# if there isn't a match, that value would be -1
**Expected output:**
id id2 num text text2 check
1 1 1 "a" "" 2
1 1 0 "" "b" NA
1 1 0 "" "a" NA
1 1 1 "a" "" -1
1 2 1 "a" "" 4
1 2 0 "" "b" NA
1 2 0 "" "b" NA
1 2 0 "" "b" NA
1 2 0 "" "a" NA
2 1 1 "b" "" 2
2 1 0 "" "a" NA
2 1 0 "" "b" NA
答案 0 :(得分:1)
我会使用data.table库:
df = data.table(df) # Make df a data table
df$RowID = 1:nrow(df) # Add a row ID column
d1 = data.table(df[num==1]) # Second data table, containing only the rows with num = 1
d1 = d1[df,on = c("id","id2",text = "text2")] # Join the two data tables
d1 = d1[i.num==0 & i.RowID > RowID & i.RowID < RowID + 5] # Get only the candidate rows
dFinal = d1[,.(check = min(i.RowID-RowID)),by='RowID'] # Find which match came first
df = dFinal[df,on="RowID"] # Join the tables
df[num==1 & is.na(check),check:=-1] # Fill empty checks where num = 1 with value -1
有几种方法可以进一步缩小这一点并减少缓存,但我将其保持分散以便于阅读和评论。我建议逐行浏览以了解每一篇文章。