Question

我有一个庞大的文本数据库，读作数据框，有一列文字，句子很少，时间用不同的格式提到，如下所示：

第1行。我试图通过xxx-xxx-xxxx给你打电话，但是到了语音邮件我正计划在2014年6月13日下午12点到太平洋标准时间下午2点之间进行下一次跟进。

第2行。如果我听到他们的消息，我今天会再次给你打电话，如果没有，明天美国东部时间下午4点到6点之间会给你打电话。

第3行。我们将等待您的回复，如果我们没有收到您的回复，我们会在明天中午12:00至下午2:00之间给您打电话

第4行。正如电话会议所讨论的那样，我们预定明天美国东部时间下午12点到02点之间回电。

第5行。正如您所建议的那样，我们将在2014年6月13日下午12点到下午2点之间进行下一次跟进。

想要提取时间部分以及EST / CST / PST。

预期产出：

6/13/2018 4 PM - 6 PM EST
  明天中午12点到太平洋标准时间下午2点

尝试了下面的内容：

x <- text$string

sc1 <- str_match(x, " follow up on (.*?) T.")

返回类似的内容：

在2014年6月13日下午1点之前跟进6/13/2018在下午1点之前

尝试使用以下代码组合其他格式

sc2 <- str_match(x, " will call you tomorrow between (.*?) T.")

并执行rowbind以包含两种格式（跟进*并将呼叫你*）

sc1rb <- rbind(sc1,sc2)

没有工作

是否可以从上面的示例字符串中仅提取时间部分和时区？

提前致谢！

Answer 1

这是适合样本的东西。正如@MrFlick所提到的，请尝试以可重现的方式共享您的数据。

数据

> dput(txt) c("Next follow up on 6/13/2018 between 12 PM and 2 PM PST.", "will call you tomorrow between 4 - 6PM EST.", "will call you tomorrow between 12:00PM to 2:00PM CST", "will call you tomorrow between 11 AM to 12 PM EST", "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST." )

<强>码

> regmatches(txt, regexec('[[:space:]]([[:digit:]]{1,2}[[:space:]].*[[:upper:]]{3})', txt)) [[1]] [1] " 12 PM and 2 PM PST" "12 PM and 2 PM PST" [[2]] [1] " 4 - 6PM EST" "4 - 6PM EST" [[3]] character(0) [[4]] [1] " 11 AM to 12 PM EST" "11 AM to 12 PM EST" [[5]] [1] " 12 PM TO 2 PM PST" "12 PM TO 2 PM PST"

输出是一个列表，其中每个元素都有两个字符向量（阅读regmatches的帮助部分）。您可以进一步简化此操作以仅获得上面指出的输出：

> unname(sapply(txt, function(z){ pattern <- '[[:space:]]([[:digit:]]{1,2}([[:space:]]|:).*[[:upper:]]{3})' k <- unlist(regmatches(z, regexec(pattern = pattern, z))) return(k[2]) })) [1] "12 PM and 2 PM PST" "4 - 6PM EST" "12:00PM to 2:00PM CST" "11 AM to 12 PM EST" [5] "12 PM TO 2 PM PST"

这基于样本输入。当然，如果输入太不规则，那么使用单个正则表达式将很困难。如果你有这种情况，我建议使用一个接一个调用的多个正则表达式函数，具体取决于前面的函数是否返回NA。希望这有用！

Answer 2

此代码适用于几乎所有规格，但此子串“美国东部时间下午4点至6点”除外。我希望它对你的整个数据有用

  data=c(

  "Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",

  "will call you tomorrow between 4 - 6PM EST.",

  "will call you tomorrow between 12:00PM to 2:00PM CST",

  "will call you tomorrow between 11 AM to 12 PM EST",

  "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST.")



  #date exclusion with regex
  data=gsub( "*(\\d{1,2}/\\d{1,2}/\\d{4})*", "", data)


  #parameters for exlusion and substitution#
  excluded_texts=c("Next follow up on","between","will call you tomorrow",":00","\\.")
  replaced_input=c("  ","\'-","and","TO"," AM"," PM")
  replaced_output=c("","to","to","to","AM","PM")

  for (i in excluded_texts){
    data=gsub(i, "", data)}

  for (j in 1:length(replaced_input)){
    data=gsub(replaced_input[j],replaced_output[j],data)

  }

print(data)

Answer 3

sub(".*?(\\d+\\s*[PA:-].*)","\\1",data)
[1] "12 PM and 2 PM PST."   "4 - 6PM EST."          "12:00PM to 2:00PM CST"
[4] "11 AM to 12 PM EST"    "12 PM TO 2 PM PST."

R - 提取时间以及作为字符串一部分的时区

3 个答案: