计算csv中的聚簇值

时间:2015-04-02 04:49:21

标签: r csv count gaps-and-islands

我有一个csv文件,其中的行包含一个名称,后跟一系列空值和聚集的实际值。

Robert,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,2:00-4:00
John,,,1:00-5:00,1:00-5:00,,,,,,,,,,,,
Casey,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,,,
Sarah,,,1:00-5:00,,,,,,,,2:00-4:00,2:00-4:00,2:00-4:00,,

我想在R中编写一个计算集群的脚本。如果行中有三个实际顺序值,那么我想将它们计为“一个”集群。如果有任何东西少于三个簇(即一个或两个连续值),那么我想把它算作“一个”单独的簇。

csv格式的所需输出:

Robert,2,0
John,0,1
Casey,1,1
Sarah,1,1

提前致谢!

1 个答案:

答案 0 :(得分:0)

这是一个使用

的旧问题的可能的data.table解决方案
  • fread()用于读取输入文件,
  • melt() / dcast()用于重塑,
  • rleid()函数来识别间隙和孤岛。

对于问题中发布的数据集,此代码

library(data.table)
library(magrittr)

fread("input.csv", header = FALSE, na.strings = c(""), fill = TRUE) %>% 
  .[, V1 := forcats::fct_inorder(V1)] %>%  # to keep the original order in dcast() below
  melt(id.var = "V1") %>% 
  setorder(V1, variable) %>% 
  .[, cluster.id := rleid(V1, is.na(value))] %>%
  .[!is.na(value), .N, by = .(V1, cluster.id)] %>% 
  dcast(V1 ~ N < 3, length, value.var = "N") %>% 
  fwrite("output.csv", col.names = FALSE)

根据要求创建csv文件:

Robert,2,0
John,0,1
Casey,1,1
Sarah,1,1

In a comment,OP提供了指向托管在github上的另一个示例数据集的链接。

进行一些修改,

fread("https://raw.githubusercontent.com/agrobins/r_IslandCount/test_files/timeclock_report.csv"
      , drop = "total hours", na.strings = c("")) %>% 
  .[, Employee := forcats::fct_inorder(Employee)] %>%  # to keep the original order in dcast() below
  melt(id.var = "Employee") %>% 
  setorder(Employee, variable) %>% 
  .[, cluster.id := rleid(Employee, is.na(value))] %>% 
  .[!is.na(value), .N, .(Employee, cluster.id)] %>% 
  dcast(Employee ~ N < 3, length, value.var = "N")

我们得到

          Employee FALSE TRUE
1:      John Smith     1    1
2:     Emily Smith     0    1
3:  Robert Jenkins     0    2
4: Rachel Lipscomb     0    1
5:   Donald Driver     1    0

名为FALSE的第一个数字列包含由三个或更多连续条目组成的簇数,而名为TRUE的第二个数字列包含由1个或2个连续条目组成的簇数。 / p>

可复制数据

由于指向外部网站的链接非常脆弱,因此这里是从
检索的第二个数据集的副本 https://raw.githubusercontent.com/agrobins/r_IslandCount/test_files/timeclock_report.csv

Employee,"Mar 23, 2015","Mar 24, 2015","Mar 25, 2015","Mar 26, 2015","Mar 27, 2015","Mar 28, 2015","Mar 29, 2015",total hours
"John Smith",16:35 - 21:17 / 4.7,16:35 - 21:17 / 4.7,16:35 - 21:17 / 4.7,,,,11:17 - 16:08 / 4.85,18.9569
"Emily Smith",,,,,,08:13 - 12:40 / 4.45,,4.4472222222222
"Robert Jenkins",16:54 - 21:11 / 4.29,16:54 - 21:11 / 4.29,,,16:22 - 22:59 / 6.61,,,15.18638
"Rachel Lipscomb",,,,,,13:18 - 19:04 / 5.76,,5.7638888888889
"Donald Driver",,,,,08:13 - 13:05 / 4.86,08:13 - 13:05 / 4.86,10:02 - 16:02 / 6,15.14694