我刚刚参与了一个项目,我们有几个巨大的文本文件(每个接近一千兆字节)我们想要放入表格并进行分析。每个文本文件由一年的数据组成,每个数据点来自三个类别中的一个,我们想要的最终结果是每个类别的一个列表,其中包含每年作为列的观察结果。
现在完成的方法是将每个文件读入一个列表,然后根据类别拆分这些列表,并为每年制作三个新列表,然后rbind
列出给定类别的所有列表不同年份进入最终名单。请参阅下文,了解我所呈现的R
文件(匿名):
Year1 <- read.table(YearOneFilePath)
table(Year1$category)
Year1A <- Year1[Year1$category == "A",]
Year1B <- Year1[Year1$category == "B",]
Year1C <- Year1[Year1$category == "C",]
rm(Year1)
Year2 <- read.table(YeartwoFilePath)
table(Year2$category)
Year2A <- Year2[Year2$category == "A",]
Year2B <- Year2[Year2$category == "B",]
Year2C <- Year2[Year2$category == "C",]
rm(Year2)
Year3 <- read.table(YearThreeFilePath)
table(Year3$category)
Year3A <- Year3[Year3$category == "A",]
Year3B <- Year3[Year3$category == "B",]
Year3C <- Year3[Year3$category == "C",]
rm(Year3)
A <- rbind(Year1A, Year2A, Year3A)
B <- rbind(Year1B, Year2B, Year3B)
C <- rbind(Year1C, Year2C, Year3C)
rm(Year1A)
rm(Year2A)
rm(Year3A)
rm(Year1B)
rm(Year2B)
rm(Year3B)
rm(Year1C)
rm(Year2C)
rm(Year3C)
在我看来,它似乎从文件中读取所有数据,并在移动时复制两次,其中包含大量数据需要很长时间和大量内存。显然,我可以通过将YearXY
直接放入YearX[YearX$Category == "Y",]
函数来绕过rbind
列表,但这仍然意味着我在执行中的某个时刻有两个完整的副本。有没有办法在每个文件的一次通读中从文件中生成最终A
,B
和C
列表,而无需再次复制所有数据?
答案 0 :(得分:0)
library(data.table)
Year1 <- fread(YearOneFilePath)
Year1[, .N ,by = category]
Year1A <- Year1[Year1$category == "A",,]
Year1B <- Year1[Year1$category == "B",,]
Year1C <- Year1[Year1$category == "C",,]
rm(Year1)
gc()
#YES garbage collection may help ;)
A <- rbind(Year1A, Year2A, Year3A)
rm(Year1A)
rm(Year2A)
rm(Year3A)
gc()
对于拆分,这是另一种方法,
split_list1=split(Year1 ,Year1$category)
Year1A <-split_list1[[1]]
Year1B <-split_list1[[2]]
Year1C <-split_list1[[3]]