我需要从网页中剥离HTML标记返回的纯文本中提取数据。标签被剥离,因为页面由表格数据组成,但表格嵌套在表格中,嵌套在表格中,等等(非常丑陋的HTML代码)。清理代码(使用HTML Tidy)并删除标签后,网站会返回如下信息:
Visitor ID : 123456789 HostName: 127.0.01 IP : 127.0.0.1 First Visit -> Entry Page : First Visit Entry Page Title Example First Visit -> Referrer: http://somepage.com First Visit : 302 Day(s) Last Visit : 09/23/2011 ISP: Initech Country: Some country Country: Some country Browser: Chrome Screen Res: Unknow 4 Billion colors (32 bit) Javascript: Enabled Page Views: 1 File Downloaded: 0 Daily Visits: 1 Visit Length: 0 minutes 0 seconds Entry Page: Entry page title Exit Page: Exit page title Referring URL: No
(正如你所看到的,一个非常漫长而随机的混乱)
我想把它变成这个:
Visitor ID: 123456789
HostName: 127.0.01
IP: 127.0.01
First Visit: 302 Day(s)
First Visit -> Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage.com
Last Visit: 09/23/2011
ISP: Initech
Country: Some country
Country: Some country
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit)
Javascript: Enabled
Page Views: 1
File Downloaded: 0
Daily Visits: 1
Visit Length: 1 minute(s) 26 second
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No
我目前正在使用regexp删除额外的空格并尝试对数据进行排序。到目前为止,它几乎正在使用它:
$patterns = array("/HostName\s*:/",
"/IP\s*:/",
"/First\s+Visit\s+->\s+Entry\s+Page\s*:/",
"/First\s+Visit\s+->\s+Referrer\s*:/",
"/First\s+Visit\s*:/",
"/\bLast\s+Visit\s*:/",
"/\bISP\s*:/",
"/\bCountry\s*:/",
"/\bBrowser\s*:/",
"/\bScreen\s*Res\s*:/",
"/\bJavascript\s*:/",
"/\bPage\s+Views\s*:/",
"/\bFile\s+Downloaded\s*:/",
"/\bDaily\s+Visits\s*:/",
"/\bVisit\s+Length\s*:/",
"/\bEntry\s+Page\s*:/",
"/\bExit\s+Page\s*:/",
"/\bReferring\s+URL\s*:/",
"/\bFrom\s+Campaign\s*:/" );
$replacements = array("\nHostName:",
"\nIP:",
"\nFirst Visit -> Entry Page:",
"\nFirst Visit -> Referrer:",
"\nFirst Visit:",
"\nLast Visit:",
"\nISP:",
"\nCountry:",
"\nBrowser:",
"\nScreen Res:",
"\nJavascript:",
"\nPage Views:",
"\nFile Downloaded:",
"\nDaily Visits:",
"\nVisit Length:",
"\nEntry Page:",
"\nExit Page:",
"\nReferring URL:",
"\nFrom Campaign:" );
ksort( $patterns );
ksort( $replacements );
$fixed_text = preg_replace ( $patterns, $replacements, $ugly_mess );
但是,这并没有像预期的那样正常工作。请注意,某些字段类似,正则表达式无法正常工作,因此产生如下所示:
Visitor ID: 123456789
HostName: 127.0.0.1
IP: 127.0.0.1
Last Visit: 302 Day(s)
First Visit: 10 June 2010
First Visit ->
Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage
.com
ISP: Initech
Country: Some Country
Country: Some Country
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit)
Javascript: Enabled
Page Views: 1
File Downloaded: 0
Daily Visits: 1
Visit Length: 1 minute(s) 26 second
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No
我可能会以错误的方式解决这个问题,所以这就是我要求对当前代码提出建议或修复的原因。有什么想法吗?
答案 0 :(得分:0)
如果你使用匹配,那么用替换模式代替它。我正在使用javascript,但您可以轻松地将其更改回PHP。
var pattern = "^(?:";
pattern += "(?:Visitor\\s*ID\\s*:\\s*(\\d+)\\s*)";
pattern += "|(?:HostName\s*:\\s*([^ ]+)\\s*)";
pattern += "|(?:IP\\s*:\\s*([^ ]+)\\s*)";
pattern += "|(?:First\\s*Visit\\s*->\\s*Entry Page\\s*:\\s*(.+?)\\s*(?=First\\s*Visit\\s*->))";
pattern += "|(?:First\\s*Visit\\s*->\\s*Referrer\\s*:\\s*(.+?)\\s*(?=First\\s*Visit\\s*:))";
pattern += "|(?:First\\s*Visit\\s*:\\s*(\\d+)\\s*Day\\(s\\)\\s*)";
pattern += "|(?:Last\\s*Visit\\s*:\\s*(\\d+/\\d+/\\d+)\\s*)";
pattern += "|(?:ISP\\s*:\\s*(.+?)\\s*(?=Country\\s*:))";
pattern += "|(?:Country\\s*:\\s*(.+?)\\s*(?=(?:Country|Browser)\\s*:))";
pattern += "|(?:Browser\\s*:\\s*(.+?)\\s*(?=Screen\\s*Res\\s*:))";
pattern += "|(?:Screen\\s*Res\\s*:\\s*(.+?)\\s*(?=Javascript\\s*:))";
pattern += "|(?:Javascript\\s*:\\s*(.+?)\\s*(?=Page\\s*Views\\s*:))";
pattern += "|(?:Page\\s*Views\\s*:\\s*(\\d+)\\s*)";
pattern += "|(?:File\\s*Downloaded\\s*:\\s*(\\d+)\\s*)";
pattern += "|(?:Daily\\s*Visits\\s*:\\s*(\\d+)\\s*)";
pattern += "|(?:Visit\\s*Length\\s*:\\s*((?:\\d+ (?:hours|minutes|seconds)\\s*)+))";
pattern += ")+";
var regex = new RegExp(pattern);
var content = readData().replace(/ /g, "");
var match = content.match(regex);
echo("Visitor Id: " + match[1]);
echo("Hostname: " + match[2]);
echo("IP: " + match[3]);
// continue on...