(关于R中列表的问题)
我正在处理一个非常大的数据集,其中的日期列采用以下两种形式之一:
我需要根据是否有破折号(类型2)来拆分此列,并将它们放在两列中(“日期1”和“日期2”)。如果我遇到带有类型1日期的行,则该日期将仅占用“日期1”,而“日期2”将仅是NA
。
这就是我想要的—转换如下所示的内容:
c(
rep("8/20/2018 9:18:45 AM", 15),
rep("8/20/2018 9:18:45 AM - 8/12/2018 9:18:45 AM", 15)
)
对此:
data.frame(
Date1 = c(rep("8/15/2018 9:18:45 AM", 15), rep("8/20/2018 9:18:45 AM", 15)),
Date2 = c(rep(NA, 15), rep("8/12/2018 9:18:45 AM", 15))
)
# output
# Date1 Date2
# 1 8/15/2018 9:18:45 AM <NA>
# 2 8/15/2018 9:18:45 AM <NA>
# 3 8/15/2018 9:18:45 AM <NA>
# 4 8/15/2018 9:18:45 AM <NA>
# 5 8/15/2018 9:18:45 AM <NA>
# 6 8/15/2018 9:18:45 AM <NA>
# 7 8/15/2018 9:18:45 AM <NA>
# 8 8/15/2018 9:18:45 AM <NA>
# 9 8/15/2018 9:18:45 AM <NA>
# 10 8/15/2018 9:18:45 AM <NA>
# 11 8/15/2018 9:18:45 AM <NA>
# 12 8/15/2018 9:18:45 AM <NA>
# 13 8/15/2018 9:18:45 AM <NA>
# 14 8/15/2018 9:18:45 AM <NA>
# 15 8/15/2018 9:18:45 AM <NA>
# 16 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 17 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 18 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 19 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 20 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 21 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 22 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 23 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 24 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 25 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 26 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 27 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 28 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 29 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 30 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
我希望列表的第一个子元素占据Date1
列,而第二个子元素(如果存在)占据Date2
列。如果没有第二个元素,我希望Date2
行为NA
。
我的第一个尝试是在使用条件的地方创建一个新列表。如果子元素的长度仅为1,则创建第二个子元素,并将其设置为NA
。
dates = c(
c(
rep("8/20/2018 9:18:45 AM", 15),
rep("8/20/2018 9:18:45 AM - 8/12/2018 9:18:45 AM", 15)
)
)
# create the date split. Split the text based on the dash
dates_split = strsplit(dates, " - ")
# note where the correct dates are. date_split[[15]] as one sub element and date_split[[16]] has two
dates_split[[15]];dates_split[[16]]
# so far so good
# create a conditional where if there is only one date (one sub element), set the second sub element to zero.
for(i in 1:length(dates_split)){
if(length(dates_split[i]) == 1){
dates_split[[i]][2] = NA
} else {}
}
# the above loop does not behave as expected. The dates_split[[16]][2] is now gone (it turned to NA)
# create a vector for Date1 and Date2
Date1 = unlist(lapply(dates_split, "[[", 1))
Date2 = unlist(lapply(dates_split, "[[", 2))
# put each date type in their appropriate column
date_df = data.frame(
Date1 = Date1,
Date2 = Date2
)
# second column is all NA's. Where did the second sub elements go?
我先前在较小数据集上的脚本做了类似的处理:
dates = strsplit(dates, " - ")
# this takes forever to do. Is there a way to do this without using a loop??
for(i in 1:nrow(dates_split)){
date_df$Date1 = dates[[i]][1]
date_df$Date2 = dates[[i]][2]
}
上面的方法不是很有效。实际数据集超过一百万行,因此将永远需要加载。
对于如何修改此步骤是否有任何建议,以便我为第二个子元素创建NA
而又不会无意间将所有内容变成NA
?
# create a conditional where if there is only one date (one sub element), set the second sub element to zero.
for(i in 1:length(dates_split)){
if(length(dates_split[i]) == 1){
dates_split[[i]][2] = NA
} else {}
}
# the above loop does not behave as expected. The dates_split[[16]][2] is now gone (it turned to NA)
谢谢!
答案 0 :(得分:1)
首先,要回答以下问题
对于如何修改此步骤是否有任何建议,以便我创建 第二个子元素的NA不会无意间将所有内容都打开 进入NA?
只需在[i]
循环的第二行中将[[i]]
替换为for
。
其次,我对您的代码进行了一些修改并测试了速度。 1000万个数据点花费了大约15秒的时间。所以这是非常快的。我尝试将for
循环替换为lapply
,但这并没有提高速度。现在,您可以使用data.table
软件包来加快它的速度(也许是显着的),但是对此有一些学习上的困难。这是用于测试的完整代码,以查看是否一切正常。
# how many times to repeat dates (five million for testing)
rep.num = 5000000
# create dummy dates
dates = c(
rep("8/20/2018 9:18:45 AM", rep.num),
rep("8/20/2018 9:18:45 AM - 8/12/2018 9:18:45 AM", rep.num)
)
# create the date split. Split the text based on the dash
# using fixed = T here results in significant speed increase
dates_split <- strsplit(dates, " - ", fixed = T)
# note where the correct dates are. date_split[[rep.num]] as one sub element and date_split[[rep.num + 1]] has two
dates_split[[rep.num]]
dates_split[[rep.num + 1]]
dates_split[[rep.num + 1]][1]
dates_split[[rep.num + 1]][2]
# create a conditional where if there is only one date (one sub element), set the second sub element to zero.
for(i in 1:length(dates_split)){
if(length(dates_split[[i]]) == 1){
dates_split[[i]][2] = NA
}
}
# put each date type in their appropriate column
date_df = data.frame(
Date1 = sapply(dates_split, "[[", 1),
Date2 = sapply(dates_split, "[[", 2)
)