我有一系列文件,我正试图从中提取日期。它们主要是纯文本和HTML,但它们使用的日期格式非常大(尽管它们都是英文日期)。如何在一长串文本中查找和解析这样的日期?
updated 2011-03-21T00:43:14
Sunday, March 20, 2011
Wednesday, March 16, 2011 | 11:25 AM
March 20, 2011 @ 12:21 pm
May 5, 2011
Published March 19, 2011
Some text here (March 19, 2011)
10/28/2011 21:16
<a href="#>Author Name</a> on Mar 17th 2011 ...
Location, ABBR., Jan. 8, 2008
01/07/2008 (6:00 pm)
By Author Name and Company 03/19/2011 09:59
Posted by Author Name on March 16, 2011 at 03:20 PM EDT
答案 0 :(得分:2)
查看strtotime功能。
// Output: March 20th, 2011 12:00:00 AM
echo date( 'F jS, Y h:i:s A', strtotime( "Sunday, March 20, 2011"));
编辑:这是一个更完整的示例,展示了如何解析一堆提供的日期。
<?php
$dates = array( '03/19/2011 09:59', 'Wednesday, March 16, 2011 | 11:25 AM', 'Sunday, March 20, 2011', 'March 20, 2011 @ 12:21 pm', 'May 5, 2011');
foreach( $dates as $date)
{
echo $date . ' ---- ' . date( 'F jS, Y h:i:s A', strtotime( str_replace( array( '@', '|'), '', $date))) . "<br />\n";
}
当然,某些日期不会按原样解析,因为date formats列表不支持它们 - 对于那些日期,您需要进行一些额外的过滤/解析来提取日期或表单将它们变成适合strtotime的字符串。
编辑由于对进一步处理输入字符串感兴趣,下面是一个如何解析文本而不使用正则表达式来获取日期的示例。注意一些日期是如何无法提取的,为此你需要更多的字符串处理,或者使用正则表达式。
作为旁注,我会调查使用正则表达式,如果提供的字符串只是包含日期的许多行变体之一。但是,如果提供的字符串是唯一可以找到日期的格式,则字符串处理应该足够了。
$str = 'updated 2011-03-21T00:43:14
Sunday, March 20, 2011
Wednesday, March 16, 2011 | 11:25 AM
March 20, 2011 @ 12:21 pm
May 5, 2011
Published March 19, 2011
Some text here (March 19, 2011)
10/28/2011 21:16
<a href="#">Author Name</a> on Mar 17th 2011 ...
Location, ABBR., Jan. 8, 2008
01/07/2008 (6:00 pm)
By Author Name and Company 03/19/2011 09:59
Posted by Author Name on March 16, 2011 at 03:20 PM EDT';
foreach( explode( "\n", $str) as $line)
{
$line = str_replace( array( '@', '|', '(', ')'), '', trim( $line));
$line = strip_tags( $line);
if( ($time = strtotime( $line)) === false)
{
echo "Could not parse line - '" . $line . "'\n"; // Need additional processing / regex here
continue;
}
echo "Converted '" . $line . "' to '" . date( 'F jS, Y h:i:s A', $time) . "'\n";
}
最终修改:
最后,举例说明如何进行一些文本处理以获取更多要解析的日期。
foreach( explode( "\n", $str) as $line)
{
$line = str_replace( array( '@', '|', '(', ')', 'Published', '...'), '', trim( $line));
$line = strip_tags( trim( $line));
if( ($time = strtotime( $line)) === false)
{
if( !(($on_position = stripos( $line, 'on')) === false))
{
$line = substr( $line, $on_position + 3);
if( ($time = strtotime( trim( $line))) === false)
{
echo "Could not parse line that contains 'on' - '" . $line . "'\n";
continue;
}
}
echo "Could not parse line - '" . $line . "'\n";
continue;
}
echo "Converted '" . $line . "' to '" . date( 'F jS, Y h:i:s A', $time) . "'\n";
}
答案 1 :(得分:2)
今晚我有一点时间所以我玩了一些正则表达式知道我正在寻找数字分组。以下解析下面的一切都很好。此外,foreach只是一个例子。正则表达式是为preg_match_all()
构建的,因此您应该能够从字符串中提取所有日期而没有任何问题。
$str = 'updated 2011-03-21T00:43:14
Sunday, March 20, 2011
Wednesday, March 16, 2011 | 11:25 AM
March 20, 2011 @ 12:21 pm
May 5, 2011
Published March 19, 2011
Some text here (March 19, 2011)
10/28/2011 21:16
<a href="#">Author Name</a> on Mar 17th 2011 ...
Location, ABBR., Jan. 8, 2008
01/07/2008 (6:00 pm)
Published under recent news one March 17, 2011. Now onto other things!
By Author Name and Company 03/19/2011 09:59
Posted by Author Name on March 16, 2011 at 03:20 PM EDT';
$months = array(
'jan', 'january',
'feb', 'febuary',
'mar', 'march',
'apr', 'april',
'may',
'june',
'july',
'aug', 'august',
'sept', 'september',
'oct', 'october',
'nov', 'november',
'dec', 'december',
);
header('Content-Type: text/plain');
foreach(explode( "\n", $str) as $line)
{
$line = str_replace(array('@', '|', '(', ')', 'at', 'on', 'am', 'pm'), '', mb_strtolower(trim($line)));
if(preg_match('/([a-z]+[, .]+)?(\d.+?)\D*?$/m', $line, $match))
{
$date = '';
// Is that word a valid month?
if(in_array(trim($match[1], ',. '), $months))
{
$date = $match[1];
}
$date .= $match[2];
if( ($date = strtotime($date)) !== false)
{
echo "Converted '" . $line . "' to '" . date( 'F jS, Y h:i:s A', $date) . "'\n";
continue;
}
}
else
{
print "Failed to find anything\n";
}
echo "Could not parse line - '" . $line . "'\n"; // Need additional processing / regex here
}
这是相当hacky的感觉,也许有人仍然可以用更好的解析器回答。