使用regexp或其他更有效的方法从纯文本中提取信息

时间:2011-09-24 07:47:37

标签: php regex web-scraping

我需要从网页中剥离HTML标记返回的纯文本中提取数据。标签被剥离,因为页面由表格数据组成,但表格嵌套在表格中,嵌套在表格中,等等(非常丑陋的HTML代码)。清理代码(使用HTML Tidy)并删除标签后,网站会返回如下信息:

Visitor ID :   123456789   HostName: 127.0.01     IP :  127.0.0.1  First Visit -> Entry Page :   First   Visit    Entry    Page    Title    Example    First Visit -> Referrer: http://somepage.com   First Visit :  302 Day(s)    Last Visit :   09/23/2011    ISP: Initech   Country:  Some country Country:  Some  country    Browser: Chrome   Screen Res: Unknow 4 Billion colors (32 bit)   Javascript: Enabled   Page Views: 1     File Downloaded: 0  Daily Visits: 1 Visit Length: 0 minutes 0 seconds Entry Page: Entry page title Exit Page: Exit page title   Referring URL: No

(正如你所看到的,一个非常漫长而随机的混乱)

我想把它变成这个:

Visitor ID: 123456789
HostName: 127.0.01
IP: 127.0.01
First Visit: 302 Day(s)
First Visit -> Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage.com
Last Visit: 09/23/2011
ISP: Initech
Country: Some country
Country: Some country
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit) 
Javascript: Enabled
Page Views: 1
File Downloaded: 0
Daily Visits: 1
Visit Length: 1 minute(s) 26 second 
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No

我目前正在使用regexp删除额外的空格并尝试对数据进行排序。到目前为止,它几乎正在使用它:

$patterns       = array("/HostName\s*:/",
                        "/IP\s*:/",
                        "/First\s+Visit\s+->\s+Entry\s+Page\s*:/",
                        "/First\s+Visit\s+->\s+Referrer\s*:/",
                        "/First\s+Visit\s*:/",
                        "/\bLast\s+Visit\s*:/",
                        "/\bISP\s*:/",
                        "/\bCountry\s*:/",
                        "/\bBrowser\s*:/",
                        "/\bScreen\s*Res\s*:/",
                        "/\bJavascript\s*:/",
                        "/\bPage\s+Views\s*:/",
                        "/\bFile\s+Downloaded\s*:/",
                        "/\bDaily\s+Visits\s*:/",
                        "/\bVisit\s+Length\s*:/",
                        "/\bEntry\s+Page\s*:/",
                        "/\bExit\s+Page\s*:/",
                        "/\bReferring\s+URL\s*:/",
                        "/\bFrom\s+Campaign\s*:/"   );

$replacements   = array("\nHostName:",
                        "\nIP:",
                        "\nFirst Visit -> Entry Page:",
                        "\nFirst Visit -> Referrer:",
                        "\nFirst Visit:",
                        "\nLast Visit:",
                        "\nISP:",
                        "\nCountry:",
                        "\nBrowser:",
                        "\nScreen Res:",
                        "\nJavascript:",
                        "\nPage Views:",
                        "\nFile Downloaded:",
                        "\nDaily Visits:",
                        "\nVisit Length:",
                        "\nEntry Page:",
                        "\nExit Page:",
                        "\nReferring URL:",
                        "\nFrom Campaign:"  );
ksort( $patterns );
ksort( $replacements );

$fixed_text      = preg_replace ( $patterns, $replacements, $ugly_mess );

但是,这并没有像预期的那样正常工作。请注意,某些字段类似,正则表达式无法正常工作,因此产生如下所示:

Visitor ID: 123456789 
HostName: 127.0.0.1 
IP: 127.0.0.1 
Last Visit: 302 Day(s) 
First Visit: 10 June 2010 
First Visit -> 
Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage
.com
ISP: Initech 
Country: Some Country 
Country: Some Country 
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit) 
Javascript: Enabled  
Page Views: 1
File Downloaded: 0 
Daily Visits: 1
Visit Length: 1 minute(s) 26 second 
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No  

我可能会以错误的方式解决这个问题,所以这就是我要求对当前代码提出建议或修复的原因。有什么想法吗?

1 个答案:

答案 0 :(得分:0)

如果你使用匹配,那么用替换模式代替它。我正在使用javascript,但您可以轻松地将其更改回PHP。

  var pattern = "^(?:";
  pattern += "(?:Visitor\\s*ID\\s*:\\s*(\\d+)\\s*)";
  pattern += "|(?:HostName\s*:\\s*([^ ]+)\\s*)";
  pattern += "|(?:IP\\s*:\\s*([^ ]+)\\s*)";
  pattern += "|(?:First\\s*Visit\\s*->\\s*Entry Page\\s*:\\s*(.+?)\\s*(?=First\\s*Visit\\s*->))";
  pattern += "|(?:First\\s*Visit\\s*->\\s*Referrer\\s*:\\s*(.+?)\\s*(?=First\\s*Visit\\s*:))";
  pattern += "|(?:First\\s*Visit\\s*:\\s*(\\d+)\\s*Day\\(s\\)\\s*)";
  pattern += "|(?:Last\\s*Visit\\s*:\\s*(\\d+/\\d+/\\d+)\\s*)";
  pattern += "|(?:ISP\\s*:\\s*(.+?)\\s*(?=Country\\s*:))";
  pattern += "|(?:Country\\s*:\\s*(.+?)\\s*(?=(?:Country|Browser)\\s*:))";
  pattern += "|(?:Browser\\s*:\\s*(.+?)\\s*(?=Screen\\s*Res\\s*:))";
  pattern += "|(?:Screen\\s*Res\\s*:\\s*(.+?)\\s*(?=Javascript\\s*:))";
  pattern += "|(?:Javascript\\s*:\\s*(.+?)\\s*(?=Page\\s*Views\\s*:))";
  pattern += "|(?:Page\\s*Views\\s*:\\s*(\\d+)\\s*)";
  pattern += "|(?:File\\s*Downloaded\\s*:\\s*(\\d+)\\s*)";
  pattern += "|(?:Daily\\s*Visits\\s*:\\s*(\\d+)\\s*)";
  pattern += "|(?:Visit\\s*Length\\s*:\\s*((?:\\d+ (?:hours|minutes|seconds)\\s*)+))";
  pattern += ")+";
  var regex = new RegExp(pattern);

  var content = readData().replace(/ /g, "");
  var match = content.match(regex);
  echo("Visitor Id: " + match[1]);
  echo("Hostname: " + match[2]);
  echo("IP: " + match[3]);
  // continue on...