Question

我正在寻找一个很好的例子，使用PHP中的正则表达式“反向工程”一个格式字母（当然已知格式），该格式已粘贴到多行文本框中并发送到脚本进行处理。

因此，举例来说，我们假设这是原始的纯文本输入（摘自美国农业部新闻稿）：

华盛顿，2010年4月5日 - 北美国野牛合作社，新罗克福德， N.D.，企业正在召回大约25,000磅的整体含有舌头的牛头可能没有完全扁桃体删除，不符合需要拆除的规定来自各年龄段牛的扁桃体，美国农业部食品安全检验局（FSIS）今天宣布。

为清楚起见，下面突出显示了变量字段：

[pr_city =]华盛顿， [pr_date =] 2010年4月5日 - [corp_name =]北 American Bison Co-Op ， [corp_city =] New Rockford ， [corp_state =] N.D。，企业正在召回 [amount =] 25,000磅 [product =]整体含有舌头的牛头可能没有完全扁桃体已删除，但不符合需要 [reason =]删除的法规来自各年龄段牛的扁桃体，美国农业部食品安全检验局（FSIS）今天宣布。

我怎样才能有效地提取

的内容

pr_city
pr_date
corp_name
corp_city
corp_state
金额
产品
原因

我的例子中的

字段？

任何帮助都将不胜感激，谢谢。

Answer 1

嗯，一个适用于你的例子的正则表达式可能看起来像这样（引入了换行符以保持这个野兽清晰，需要在使用前删除）：

/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a 
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is 
recalling approximately (?P<amount>.*?) of (?P<product>.*?), 
which is not compliant with regulations that require (?P<reason>.*?), 
the U\.S\. Department of Agriculture\'s Food Safety and Inspection 
Service \(FSIS\) announced today\.$/

所以，在PHP中你可以做到

if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
    $prcity = $regs['pr_city'];
    $prdate = $regs['pr_date'];
    ... etc.
} else {
    $result = "";
}

这假定了一些事情，例如没有换行符，并且输入是整个字符串（而不是从中提取该部分的较大字符串）。我试图对合法价值做出一些有意义的假设，但是其他投入很可能会打破这个。因此可能需要更多的测试用例。

Answer 2

如果周围的文本是常量，那么像这个部分正则表达式的东西可以做到这一点：

preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);

$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...

如果周围的文字发生了变化，那么你最终会得到大量的错误匹配，没有匹配等......基本上你需要一个AI来解析/理解PR版本。

Answer 3

编辑：请忽略这个疯狂的答案，因为其他两个更好。我应该删除它，但我会保留它以供参考。

我有一个疯狂的想法，可能会工作：通过添加标记从输入构建XML字符串，然后解析它。它可能看起来像这样（完全未经测试的）代码：

preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');

之后解析XML是一个不必要的复杂过程，最好留给PHP文档：http://www.php.net/manual/en/function.xml-parse.php。

您还可以考虑使用此方法将其转换为JSON，然后使用json_decode（）来解析它。在任何情况下，您都必须考虑当"标记和>符号出现在输入中时会发生什么。

一次只匹配并删除一段文本可能更容易。

使用正则表达式从纯文本格式字母中提取变量？

3 个答案: