Question

更新

我设法使用以下代码加载前1000000行的数据：

newFile <- read.table("course_4_proj_1.txt", header=TRUE, sep=";", na.strings = "?", nrows= 1000000, stringsAsFactors=TRUE)

这是head()返回的内容，作为FYI

head(newFile)
        Date     Time Global_active_power Global_reactive_power Voltage Global_intensity
1 16/12/2006 17:24:00               4.216                 0.418  234.84             18.4
2 16/12/2006 17:25:00               5.360                 0.436  233.63             23.0
3 16/12/2006 17:26:00               5.374                 0.498  233.29             23.0
4 16/12/2006 17:27:00               5.388                 0.502  233.74             23.0
5 16/12/2006 17:28:00               3.666                 0.528  235.68             15.8
6 16/12/2006 17:29:00               3.520                 0.522  235.02             15.0
  Sub_metering_1 Sub_metering_2 Sub_metering_3
1              0              1             17
2              0              1             16
3              0              2             17
4              0              1             17
5              0              1             17
6              0              2             17

现在我需要进行分组，因为我只需要使用日期2007-02-01和2007-02-02中的数据。但我想我需要使用strptime()和as.Date()函数将日期和时间变量转换为R中的日期/时间类，但我不清楚如何做到这一点。什么是最简单/最干净的方法？

Answer 1

如果尺寸/内存不是问题，

newFile <- read.table("course_4_proj_1.txt", header=TRUE, sep=";", na.strings = "?", nrows= 1000000, 
    stringsAsFactors=FALSE)
newFile$DateTime <- paste(newFile$Date, newFile$Time), 
newFile$DateTime <- as.Date(newFile$DateTime, format = "%d/%m/%Y %H:%M:%S")

如果您的计算机太弱而且微不足道，但您可以添加软件包，请考虑data.table软件包

library(data.table)
newFile <- fread("course_4_proj_1.txt", na.strings = "?")

newFile[,DateTime := as.Date(paste(Date, Time), format = "%d/%m/%Y %H:%M:%S")]

并且可以使用进一步的优化。我找到了answers here useful。

然后可以以正常方式对data.frame进行子集化。以下是使用dplyr

的方法

library(dplyr)
subsetted <- filter(newFile, DateTime >= as.Date("2006-02-01 00:00:00"), DateTime < as.Date("2006-02-03 00:00:00"))

Answer 2

标准R read.table函数始终首先在整个数据集中读取。您可以考虑在读入R之前以其他方式过滤文件，或者使用像sqldf这样的包，它具有可以在导入时过滤数据的read.csv.sql函数。我还没有用日期课进行测试。

根据R中的日期范围子集数据

2 个答案: