在Casimir et Hippolyte的帮助下,我一直在尝试解析下面示例中显示的一些文本(注意:我原来的问题过于简化了示例文本 - 因为我认为我很容易适应所提供的任何解决方案实际的文字。然而,在用手指敲击墙上的钥匙和头部之后,我仍然没有更聪明。)
这是我到目前为止所做的...我已经尝试了转发数据(addslashes),但我想我会以原始(呃)形式发布$ subject ...
<?php
$subject = "
Ydqk‚_,¦#¦#À%¦#¦#¦#¦#¦#èeèe2%Ž¦#¦#¦#Cf¦#¦#¦#¦#qk¦#¦#¦#¦#¦#¦#¦#¦#¦#Ð ð:SOME COMPANY<br />
WITH A LONG NAME<br />
The Big Barn, 23 London Lane, Cheltenham, Glos. GL1 1GL<br />
Tel. 022234 567890 Fax. 02234 345678 Email. <a href= mailto:info@some.co.uk </a>info@some.co.uk<br />
Company: Another Company (AKA) – 22 London Lane, Cheltenham, GL1 2GL<br />
FAO: Mr D. Mistify/ A. Clarity/ Jo Bloggs<br />
PROJECT OMAHA <br />
<br />
<br />
CONTRACT No. 14 DATE 10/6/13 <br />
No. QUESTION ANSWER <br />
<br />
973 <br />
Hi, it's me again:<br />
I'm very, very confused. Why do regular expressions seem such a dark art?<br />
Surely it can't be as hard as I manage to make it seem?<br />
Please advise<br />
Thank you. <br />
Date Required – <br />
17/6/13 <br />
<br />
Signed for and on behalf of Some Company with a Long Name Limited<br />
Me Again – Senior Moment<br />
________________________________________________________<br />
<br />
<br />
<br />
<br />
<br />
<br />
QUESTION / ANSWER SHEET<br />
Some Company with a Long Name<br />
<br />
Question and Answer System<br />
AA414<br />
’“¸¹ÉÊËÌÔ...descends into gibberish...
";
$pattern = '~
Project\hNo\.\h\d++\hDATE\h
(?<date>\d{1,2}\/\d{1,2}\/\d{1,2})
\s++No\.\hQUESTION\hANSWER\s++
(?<No>\d++)\s++
# all characters but D or D not followed by "ate Required"
(?<desc>(?>[^D]++|D(?!ate\hRequired))+)
\D++
(?<date_required>\d{1,2}\/\d{1,2}\/\d{1,2})
~x';
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
print_r($matches);
?>
我想提取以下内容:
答案 0 :(得分:1)
这个想法是通过不使用点来避免出现新行($subject
是你的字符串):
$pattern = '~
Project\hNo\.\h\d++\hDATE\h
(?<date>\d{1,2}\/\d{1,2}\/\d{1,2})
\s++No\.\hQUESTION\hANSWER\s++
(?<No>\d++)\s++
# all characters but D or D not followed by "ate Required"
(?<desc>(?>[^D]++|D(?!ate\hRequired))+)
\D++
(?<date_required>\d{1,2}\/\d{1,2}\/\d{1,2})
~x';
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
print_r($matches);
请注意,我使用possessive quantifiers和atomic groups来避免可能的回溯
根据你的新示例字符串,我给你一种新的模式(以某种lex风格),它更具可读性和可编辑性:
$pattern = <<<LOD
~
# Raw types
(?(DEFINE)(?<uint> \d++ ))
(?(DEFINE)(?<date> \d{1,2}\/\d{1,2}\/\d{1,2} ))
# Custom types
(?(DEFINE)(?<void> (?>\s++|<br\b[^>]*+>)* ))
(?(DEFINE)(?<desc> (?>[^D]++|D(?!ate\h++Required))+ ))
# Anchors
(?(DEFINE)(?<A_prj_date> PROJECT(?>[^D]++|D(?!ATE\b))+DATE\h*+ ))
(?(DEFINE)(?<A_prj_number> \g<void>No\.\h++QUESTION\h++ANSWER\b\D++ ))
(?(DEFINE)(?<A_prj_desc> \g<void> ))
(?(DEFINE)(?<A_prj_date_req> Date\h++Required\D++ ))
# Pattern
\g<A_prj_date> (?<prj_date> \g<date> )
\g<A_prj_number> (?<prj_number> \g<uint> )
\g<A_prj_desc> (?<prj_desc> \g<desc> )
\g<A_prj_date_req> (?<prj_date_req> \g<date> )
~xi
LOD;
首先是您需要的每个组件的定义。
之后,您将拥有由这些元素组成的模式。
您可以获得高度可编辑的内容,因为您可以根据需要调整所有子模式,添加新的子模式并与其他子模式一起构建新的子模式。
示例,您可以尝试将{i> A_prj_number 子模式替换为\D++
,这似乎对您的示例字符串足够好:
(?(DEFINE)(?<A_prj_number>\D++))
这种语法的另一个优点是,您可以轻松调试模式逐个注释从最后到第一个元素(在模式部分中),直到您获得匹配:
# Pattern
\g<A_prj_date> (?<prj_date> \g<date> )
\g<A_prj_number> (?<prj_number> \g<uint> )
# \g<A_prj_desc> (?<prj_desc> \g<desc> )
# \g<A_prj_date_req> (?<prj_date_req> \g<date> )
注意:如果您只有一个项目,请使用 preg_match 而不是 preg_match_all 。