我有一些包含感兴趣子集的数据框。 问题是这个子集在不同的数据帧之间是不一致的。尽管如此,处于更抽象的层次,遵循一般结构:数据框内的矩形区域。
example1 <- data.frame(x = c("name", "129-2", NA, NA, "acc", 2, 3, 4, NA, NA),
y = c(NA, NA, NA, NA, "deb", 3, 2, 5, NA, NA),
z = c(NA, NA, NA, NA, "asset", 1, 1, 2, NA, NA))
print(example1)
x y z
1 name <NA> <NA>
2 129-2 <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 acc deb asset
6 2 3 1
7 3 2 1
8 4 5 2
9 <NA> <NA> <NA>
10 <NA> <NA> <NA>
example1
包含清晰矩形区域,其中包含结构信息:
5 acc deb asset
6 2 3 1
7 3 2 1
8 4 5 2
如前所述,该地区并不总是一致的,
这是另一个example2
:
example2 <- data.frame(x = c("name", "129-2", "wallabe #23", NA, NA, "acc", 2, 3, 4, NA ),
y = c(NA, NA, NA, NA, "balance", "deb", 3, 2, 5, NA),
z = c(NA, NA, NA, NA, NA, "asset", 1, 1, 2, NA),
u = c(NA, NA, NA, "currency:", NA, NA, NA, NA, NA, NA),
i = c(NA, NA, NA, "USD", "result", "win", 2, 3, 1, NA),
o = c(NA, NA, NA, NA, NA, "lose", 2, 2, 1, NA))
print(example2)
> example2
x y z u i o
1 name <NA> <NA> <NA> <NA> <NA>
2 129-2 <NA> <NA> <NA> <NA> <NA>
3 wallabe #23 <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> currency: USD <NA>
5 <NA> balance <NA> <NA> result <NA>
6 acc deb asset <NA> win lose
7 2 3 1 <NA> 2 2
8 3 2 1 <NA> 3 2
9 4 5 2 <NA> 1 1
10 <NA> <NA> <NA> <NA> <NA> <NA>
example2
包含不清晰的矩形区域:
6 acc deb asset <NA> win lose
7 2 3 1 <NA> 2 2
8 3 2 1 <NA> 3 2
9 4 5 2 <NA> 1 1
扫描此数据帧以在其中找到此类区域的一种方法是什么?
感谢任何想法
答案 0 :(得分:3)
您可能想尝试使用相同数量NA
s的最长序列:
findTable <- function(df){
naSeq <- rowSums(is.na(df)) # How many NA per row
myRle <- rle(naSeq )$length # Find sequences length
df[rep(myRle == max(myRle), myRle),] # Get longest sequence
}
findTable(example1)
x y z
5 acc deb asset
6 2 3 1
7 3 2 1
8 4 5 2
findTable(example2)
x y z u i o
6 acc deb asset <NA> win lose
7 2 3 1 <NA> 2 2
8 3 2 1 <NA> 3 2
9 4 5 2 <NA> 1 1