获取不均匀列表的第二个元素

时间:2019-03-16 00:01:11

标签: r string list split

(关于R中列表的问题)

我正在处理一个非常大的数据集,其中的日期列采用以下两种形式之一:

  • 日期类型1:“ MM / DD / YYYY HH:MM:SS AM”
  • 日期类型2:“ MM / DD / YYYYHH:MM:SS AM-MM / DD / YYYY HH:MM:SS AM”

我需要根据是否有破折号(类型2)来拆分此列,并将它们放在两列中(“日期1”和“日期2”)。如果我遇到带有类型1日期的行,则该日期将仅占用“日期1”,而“日期2”将仅是NA

这就是我想要的—转换如下所示的内容:

c(
    rep("8/20/2018 9:18:45 AM", 15),
    rep("8/20/2018 9:18:45 AM - 8/12/2018 9:18:45 AM", 15)
  )

对此:

data.frame(
  Date1 = c(rep("8/15/2018 9:18:45 AM", 15), rep("8/20/2018 9:18:45 AM", 15)),
  Date2 = c(rep(NA, 15), rep("8/12/2018 9:18:45 AM", 15))
)

# output
# Date1                Date2
# 1  8/15/2018 9:18:45 AM                 <NA>
#   2  8/15/2018 9:18:45 AM                 <NA>
#   3  8/15/2018 9:18:45 AM                 <NA>
#   4  8/15/2018 9:18:45 AM                 <NA>
#   5  8/15/2018 9:18:45 AM                 <NA>
#   6  8/15/2018 9:18:45 AM                 <NA>
#   7  8/15/2018 9:18:45 AM                 <NA>
#   8  8/15/2018 9:18:45 AM                 <NA>
#   9  8/15/2018 9:18:45 AM                 <NA>
#   10 8/15/2018 9:18:45 AM                 <NA>
#   11 8/15/2018 9:18:45 AM                 <NA>
#   12 8/15/2018 9:18:45 AM                 <NA>
#   13 8/15/2018 9:18:45 AM                 <NA>
#   14 8/15/2018 9:18:45 AM                 <NA>
#   15 8/15/2018 9:18:45 AM                 <NA>
#   16 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 17 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 18 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 19 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 20 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 21 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 22 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 23 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 24 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 25 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 26 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 27 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 28 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 29 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM
# 30 8/20/2018 9:18:45 AM 8/12/2018 9:18:45 AM

我希望列表的第一个子元素占据Date1列,而第二个子元素(如果存在)占据Date2列。如果没有第二个元素,我希望Date2行为NA

我的第一个尝试是在使用条件的地方创建一个新列表。如果子元素的长度仅为1,则创建第二个子元素,并将其设置为NA

dates = c(
  c(
    rep("8/20/2018 9:18:45 AM", 15),
    rep("8/20/2018 9:18:45 AM - 8/12/2018 9:18:45 AM", 15)
  )
)


# create the date split. Split the text based on the dash 
dates_split = strsplit(dates, " - ")
# note where the correct dates are. date_split[[15]] as one sub element and date_split[[16]] has two
dates_split[[15]];dates_split[[16]]

# so far so good






# create a conditional where if there is only one date (one sub element), set the second sub element to zero.
for(i in 1:length(dates_split)){
  if(length(dates_split[i]) == 1){
    dates_split[[i]][2] = NA
  } else {}
}

# the above loop does not behave as expected. The dates_split[[16]][2] is now gone (it turned to NA)






# create a vector for Date1 and Date2
Date1 = unlist(lapply(dates_split, "[[", 1))
Date2 = unlist(lapply(dates_split, "[[", 2))

# put each date type in their appropriate column
date_df = data.frame(
  Date1 = Date1,
  Date2 = Date2
)

# second column is all NA's. Where did the second sub elements go?

我先前在较小数据集上的脚本做了类似的处理:

dates = strsplit(dates, " - ")

# this takes forever to do. Is there a way to do this without using a loop??
for(i in 1:nrow(dates_split)){
  date_df$Date1 = dates[[i]][1]
  date_df$Date2 = dates[[i]][2]
}

上面的方法不是很有效。实际数据集超过一百万行,因此将永远需要加载。

对于如何修改此步骤是否有任何建议,以便我为第二个子元素创建NA而又不会无意间将所有内容变成NA

# create a conditional where if there is only one date (one sub element), set the second sub element to zero.
for(i in 1:length(dates_split)){
  if(length(dates_split[i]) == 1){
    dates_split[[i]][2] = NA
  } else {}
}

# the above loop does not behave as expected. The dates_split[[16]][2] is now gone (it turned to NA)

谢谢!

1 个答案:

答案 0 :(得分:1)

首先,要回答以下问题

  

对于如何修改此步骤是否有任何建议,以便我创建   第二个子元素的NA不会无意间将所有内容都打开   进入NA?

只需在[i]循环的第二行中将[[i]]替换为for

其次,我对您的代码进行了一些修改并测试了速度。 1000万个数据点花费了大约15秒的时间。所以这是非常快的。我尝试将for循环替换为lapply,但这并没有提高速度。现在,您可以使用data.table软件包来加快它的速度(也许是显着的),但是对此有一些学习上的困难。这是用于测试的完整代码,以查看是否一切正常。

# how many times to repeat dates (five million for testing)
rep.num = 5000000

# create dummy dates
dates = c(
    rep("8/20/2018 9:18:45 AM", rep.num),
    rep("8/20/2018 9:18:45 AM - 8/12/2018 9:18:45 AM", rep.num)
)

# create the date split. Split the text based on the dash 
# using fixed = T here results in significant speed increase
dates_split <- strsplit(dates, " - ", fixed = T)

# note where the correct dates are. date_split[[rep.num]] as one sub element and date_split[[rep.num + 1]] has two
dates_split[[rep.num]]
dates_split[[rep.num + 1]]
dates_split[[rep.num + 1]][1]
dates_split[[rep.num + 1]][2]

# create a conditional where if there is only one date (one sub element), set the second sub element to zero.
for(i in 1:length(dates_split)){
  if(length(dates_split[[i]]) == 1){
    dates_split[[i]][2] = NA
  }
}

# put each date type in their appropriate column
date_df = data.frame(
  Date1 = sapply(dates_split, "[[", 1),
  Date2 = sapply(dates_split, "[[", 2)
)