我有一个csv文件,其中的行包含一个名称,后跟一系列空值和聚集的实际值。
Robert,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,2:00-4:00
John,,,1:00-5:00,1:00-5:00,,,,,,,,,,,,
Casey,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,,,
Sarah,,,1:00-5:00,,,,,,,,2:00-4:00,2:00-4:00,2:00-4:00,,
我想在R中编写一个计算集群的脚本。如果行中有三个实际顺序值,那么我想将它们计为“一个”集群。如果有任何东西少于三个簇(即一个或两个连续值),那么我想把它算作“一个”单独的簇。
csv格式的所需输出:
Robert,2,0
John,0,1
Casey,1,1
Sarah,1,1
提前致谢!
答案 0 :(得分:0)
这是一个使用
的旧问题的可能的data.table
解决方案
fread()
用于读取输入文件,melt()
/ dcast()
用于重塑,rleid()
函数来识别间隙和孤岛。对于问题中发布的数据集,此代码
library(data.table)
library(magrittr)
fread("input.csv", header = FALSE, na.strings = c(""), fill = TRUE) %>%
.[, V1 := forcats::fct_inorder(V1)] %>% # to keep the original order in dcast() below
melt(id.var = "V1") %>%
setorder(V1, variable) %>%
.[, cluster.id := rleid(V1, is.na(value))] %>%
.[!is.na(value), .N, by = .(V1, cluster.id)] %>%
dcast(V1 ~ N < 3, length, value.var = "N") %>%
fwrite("output.csv", col.names = FALSE)
根据要求创建csv文件:
Robert,2,0 John,0,1 Casey,1,1 Sarah,1,1
In a comment,OP提供了指向托管在github上的另一个示例数据集的链接。
进行一些修改,
fread("https://raw.githubusercontent.com/agrobins/r_IslandCount/test_files/timeclock_report.csv"
, drop = "total hours", na.strings = c("")) %>%
.[, Employee := forcats::fct_inorder(Employee)] %>% # to keep the original order in dcast() below
melt(id.var = "Employee") %>%
setorder(Employee, variable) %>%
.[, cluster.id := rleid(Employee, is.na(value))] %>%
.[!is.na(value), .N, .(Employee, cluster.id)] %>%
dcast(Employee ~ N < 3, length, value.var = "N")
我们得到
Employee FALSE TRUE 1: John Smith 1 1 2: Emily Smith 0 1 3: Robert Jenkins 0 2 4: Rachel Lipscomb 0 1 5: Donald Driver 1 0
名为FALSE
的第一个数字列包含由三个或更多连续条目组成的簇数,而名为TRUE
的第二个数字列包含由1个或2个连续条目组成的簇数。 / p>
由于指向外部网站的链接非常脆弱,因此这里是从
检索的第二个数据集的副本
https://raw.githubusercontent.com/agrobins/r_IslandCount/test_files/timeclock_report.csv
Employee,"Mar 23, 2015","Mar 24, 2015","Mar 25, 2015","Mar 26, 2015","Mar 27, 2015","Mar 28, 2015","Mar 29, 2015",total hours "John Smith",16:35 - 21:17 / 4.7,16:35 - 21:17 / 4.7,16:35 - 21:17 / 4.7,,,,11:17 - 16:08 / 4.85,18.9569 "Emily Smith",,,,,,08:13 - 12:40 / 4.45,,4.4472222222222 "Robert Jenkins",16:54 - 21:11 / 4.29,16:54 - 21:11 / 4.29,,,16:22 - 22:59 / 6.61,,,15.18638 "Rachel Lipscomb",,,,,,13:18 - 19:04 / 5.76,,5.7638888888889 "Donald Driver",,,,,08:13 - 13:05 / 4.86,08:13 - 13:05 / 4.86,10:02 - 16:02 / 6,15.14694