假设在这种愚蠢的布局中,我已经有了这个数据集:
originalDF <- data.frame(
Index = 1:14,
Field = c("Name", "Weight", "Age", "Name", "Weight", "Age", "Height", "Name", "Weight", "Age", "Height", "Name", "Age", "Height"),
Value = c("Sara", "115", "17", "Bob", "158", "22", "72", "Irv", "210", "42", "68", "Fred", "155", "65")
)
我希望它看起来像这样:
基本上,我想将“体重”,“年龄”和“身高”行与其上方的“名称”行进行匹配。使用dplyr
可以很容易地拆分数据:
namesDF <- originalDF %>%
filter(Field == "Name")
detailsDF <- originalDF %>%
filter(!Field == "Name")
从这里开始,使用索引(行号)似乎是最好的方法,即将detailsDF
中的每一行与namesDF
中具有最接近索引的条目匹配,而不要进行遍历。我使用了fuzzyjoin
软件包,并通过
fuzzy_left_join(detailsDF, namesDF, by = "Index", match_fun = list(`>`))
这种 的方法有效,但是它也将detailsDF
中的每一行与namesDF
中的每一行都以较小的索引号连接起来:
我想出了一个解决方法,该方法使用到下一个索引的距离并以这种方式过滤掉多余的行,但是我想避免这样做。实际的源文件将超过20万行,并且带有额外行的临时结果数据帧将太大而无法容纳到内存中。我在这里能做什么?谢谢!
答案 0 :(得分:2)
我建议通过跟踪每个点的最新“名称”值以不同的方式进行处理。 tidyr软件包中的fill()
对此很有用。
library(dplyr)
library(tidyr)
originalDF %>%
mutate(Name = ifelse(Field == "Name", as.character(Value), NA)) %>%
fill(Name) %>%
filter(Field != "Name")
输出:
Index Field Value Name
1 2 Weight 115 Sara
2 3 Age 17 Sara
3 5 Weight 158 Bob
4 6 Age 22 Bob
5 7 Height 72 Bob
6 9 Weight 210 Irv
7 10 Age 42 Irv
8 11 Height 68 Irv
9 13 Age 155 Fred
10 14 Height 65 Fred
但是,如果您确实想使用Fuzzyjoin方法,则可以在结果上使用group_by()
和slice()
来实现此目的,在该结果中,您为Index.x
的每个值获取最后一行
fuzzy_left_join(detailsDF, namesDF, by = "Index", match_fun = list(`>`)) %>%
group_by(Index.x) %>%
slice(n()) %>%
ungroup()
输出:
# A tibble: 10 x 6
Index.x Field.x Value.x Index.y Field.y Value.y
<int> <fct> <fct> <int> <fct> <fct>
1 2 Weight 115 1 Name Sara
2 3 Age 17 1 Name Sara
3 5 Weight 158 4 Name Bob
4 6 Age 22 4 Name Bob
5 7 Height 72 4 Name Bob
6 9 Weight 210 8 Name Irv
7 10 Age 42 8 Name Irv
8 11 Height 68 8 Name Irv
9 13 Age 155 12 Name Fred
10 14 Height 65 12 Name Fred
答案 1 :(得分:0)
您可以使用
x = which(originalDF$Field == "Name")
originalDF$Name = rep(originalDF$Value[x], times = diff(c(x, NROW(originalDF)+1)))
NewDF = originalDF[originalDF$Field != 'Name', c(4,2,3)]
# Name Field Value
# 2 Sara Weight 115
# 3 Sara Age 17
# 5 Bob Weight 158
# 6 Bob Age 22
# 7 Bob Height 72
# 9 Irv Weight 210
# 10 Irv Age 42
# 11 Irv Height 68
# 13 Fred Age 155
# 14 Fred Height 65
答案 2 :(得分:0)
您可以按cumsum(Field == "Name")
分组。与dplyr ...
library(dplyr)
originalDF %>%
group_by(Name = Value[Field == "Name"][cumsum(Field == "Name")]) %>%
slice(-1) %>% select(c("Name", "Field", "Value"))
# A tibble: 10 x 3
# Groups: Name [4]
Name Field Value
<fct> <fct> <fct>
1 Bob Weight 158
2 Bob Age 22
3 Bob Height 72
4 Fred Age 155
5 Fred Height 65
6 Irv Weight 210
7 Irv Age 42
8 Irv Height 68
9 Sara Weight 115
10 Sara Age 17
带有data.table ...
library(data.table)
data.table(originalDF)[,
.SD[-1],
by=.(Name = Value[Field == "Name"][cumsum(Field == "Name")]), .SDcols=c("Field", "Value")]