从PHP中的文本中提取日期,时间和日期范围

时间:2013-12-30 10:45:29

标签: php regex date

我正在构建一个本地活动日历,该日历采用RSS提要和网站抓取并从中提取事件日期。

我之前已经问过如何从PHP here中的文本中提取日期,并在MarcDefiant时收到了一个很好的答案:

function parse_date_tokens($tokens) {
  # only try to extract a date if we have 2 or more tokens
  if(!is_array($tokens) || count($tokens) < 2) return false;
  return strtotime(implode(" ", $tokens));
}

function extract_dates($text) {
  static $patterns = Array(
    '/^[0-9]+(st|nd|rd|th|)?$/i', # day
    '/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
    '/^20[0-9]{2}$/', # year
    '/^of$/' #words
  );
  # defines which of the above patterns aren't actually part of a date
  static $drop_patterns = Array(
    false,
    false,
    false,
    true
  );
  $tokens = Array();
  $result = Array();
  $text = str_word_count($text, 1, '0123456789'); # get all words in text

  # iterate words and search for matching patterns
  foreach($text as $word) {
    $found = false;
    foreach($patterns as $key => $pattern) {
      if(preg_match($pattern, $word)) {
        if(!$drop_patterns[$key]) {
          $tokens[] = $word;
        }
        $found = true;
        break;
      }
    }

    if(!$found) {
      $result[] = parse_date_tokens($tokens);
      $tokens = Array();
    }
  }
  $result[] = parse_date_tokens($tokens);

  return array_filter($result);
}

# test
$texts = Array(
  "The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
  "Valentines Special @ The Radisson, Feb 14th",
  "On Friday the 15th of February, a special Hollywood themed [...]",
  "Symposium on Childhood Play on Friday, February 8th",
  "Hosting a craft workshop March 9th - 11th in the old [...]"
);

$dates = extract_dates(implode(" ", $texts));
echo "Dates: \n";
foreach($dates as $date) {
  echo "  " . date('d.m.Y H:i:s', $date) . "\n";
}

然而,该解决方案有一些缺点 - 首先,它无法匹配日期范围。

我现在正在寻找一种更复杂的解决方案,可以从示例文本中提取日期,时间和日期范围。

对此最好的方法是什么?看起来我正在靠回一系列正则表达式,一个接一个地运行以捕获这些情况。我无法看到更好的方法来捕捉日期范围,但我知道必须有更好的方法来做到这一点。是否有任何库只用于PHP中的日期解析?

日期/日期范围样本,根据要求

$dates = [
    " Saturday 28th December",
    "2013/2014",
    "Friday 10th of January",
    "Thursday 19th December",
    " on Sunday the 15th December at 1 p.m",
    "On Saturday December 14th ",
    "On Saturday December 21st at 7.30pm",
    "Saturday, March 21st, 9.30 a.m.",
    "Jan-April 2014",
    "January 21st - Jan 24th 2014",
    "Dec 30th - Jan 3rd, 2014",
    "February 14th-16th, 2014",
    "Mon 14 - Wed 16 April, 12 - 2pm",
    "Sun 13 April, 8pm",
    "Mon 21 - Wed 23 April",
    "Friday 25 April, 10 – 3pm",            
    "The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
    "Valentines Special @ The Radisson, Feb 14th",
    "On Friday the 15th of February, a special Hollywood themed [...]",
    "Symposium on Childhood Play on Friday, February 8th",
    "Hosting a craft workshop March 9th - 11th in the old [...]"
];

我目前正在使用的功能(不是上述功能)大约90%准确。它可以捕获日期范围,但如果还指定了时间则有困难。它使用正则表达式列表,非常复杂。

更新日期:2014年1月6日

我正在处理执行此操作的代码,处理一系列正则表达式的原始方法依次运行。我认为我接近一个可以从一段文本中提取几乎任何日期/时间范围/格式的工作解决方案。当我完成后,我会在这里发布它作为答案。

1 个答案:

答案 0 :(得分:4)

我认为你可以在下面的问题中总结你的问题中的正则表达式。

(?<date_format_1>(?<day>(?i)\b\s*[0-9]+(?:st|nd|rd|th|)?)(?<month>(?i)\b\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|etc))(?<year>\b\s*20[0-9]{2}) ) |
(?<date_format_2>(?&month)(?&day)(?!\s+-)) |
(?<date_format_3>(?&day)\s+of\s+(?&month)) |
(?<range_type_1>(?&month)(?&day)\s+-\s+(?&day))

标志: x

描述

Regular expression visualization

演示

http://regex101.com/r/wP5fR4

讨论

通过使用递归子模式,可以降低最终正则表达式的复杂性。 我在date_format_2中使用了否定前瞻,因为它会部分匹配range_type_1。您可能需要根据数据添加更多范围类型。如果部分匹配,请不要忘记检查其他分区。

另一个解决方案是在不同的字符串变量中构建小的正则表达式,然后在PHP中连接它们以构建更大的正则表达式。