Question

我需要从html代码中提取字符串。我有一个正则表达式。在我打开文件后（或在我发出“获取”请求后），我需要找到模式。

所以，我有一个HTML代码，我想找到这样的字符串：

<input type="hidden" name="qid" ... anything is possible bla="blabla" ... value="8">

我想找到字符串 qid ，然后在其后面找到字符串 value =“435345”并提取435345.

现在我只是想找到这个字符串（我已经完成了），然后我会做一个替换（我要去做），但是这段代码找不到模式。有什么问题？

open(URLS_OUT, $foundResults);
@lines = <URLS_OUT>;
$content = join('', @lines);

$content =~ /<qid\"\s*value=[^>][0-9]+/;
print 'Yes'.$1.'\n';

close(URLS_OUT);

或此代码：

my $content = $response->content(); 

while ($content =~ /<qid\"\s*value=[^>][0-9]+/g)
    {
        print 'Yes'.$1.'\n';
    }

我已检查过该文件是否为空并且已正确打开（我已将其打印出来），但我程序找不到模式。怎么了？我使用此引用（以及其他一些）检查了正则表达式：http://gskinner.com/RegExr/ 它表明正则表达式是正确的，并找到我需要的东西。

Answer 1

像这样更新你的正则表达式：

/<qid\"\s*value=([^>][0-9]+)/

即添加“（”和“）”以捕获$1

中的数据

Answer 2

你的想法如何：

$content =~ /<qid\"\s*value=[^>][0-9]+/;

工作是错误的。请学习basic Regex usage in Perl。

BTW：您不应该通过正则表达式解析HTML。有很多examples on the web and on SO如何正确地做到这一点。查一查！

<小时/> 出于学习目的，您的正则表达式将如下所示（根据您的评论）：

my $content = q{
 <input type="hidden" id="qid" name="qid" bla="blabla" value="8">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="98">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="788">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="128">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="8123">
};
my $regex = qr{ name=     # find the attribute 'name'
                "qid"     # with a content of "quid"
                .+?       # now search along until the next 'value'
                value=    # the following attribute 'value' 
                "(\d+)    # find the number and capture it
              }x;   ## allow the regex to be formatted   

while( $content =~ /$regex/g ) { # /g - search along
   print "Yes $1 \n"
}

完成此工作后，请研究如何使用HTML-Parser阅读内容。

Answer 3

使用HTML::Parser来处理凌乱的现实HTML。

#! /usr/bin/env perl

use strict;
use warnings;

use HTML::Parser;

sub start {
  my($attr,$attrseq) = @_;
  while (defined(my $name = shift @$attrseq)) {  # first ...="qid"
    last if $attr->{$name} eq "qid";
  }
  while (defined(my $name = shift @$attrseq)) {  # then value="<num>"
    if ($name eq "value" && $attr->{$name} =~ /\A[0-9]+\z/) {
      print "Yes", $attr->{$name}, "\n";
    }
  }
}

my $p = HTML::Parser->new(
  api_version => 3,
  start_h => [\&start, "attr, attrseq"],
);
$p->parse_file(*DATA);

__DATA__
<input type="hidden" name="qid" value="8">
<input type="hidden" name="qidx" value="000000">
<foo type="hidden" name="qid" value="9">
<foo type="hidden" name="qid" value="000000x">
<foo type="hidden" name="QID" value="000000">
<bar type="hidden" NAME="qid" value="10">
<baz type="hidden" name="qid" VALUE="11">
<quux type="hidden" NAME="qid" VALUE="12">

输出：

Yes8
Yes9
Yes10
Yes11
Yes12

Answer 4

要使$1包含值，您需要使用Capture Group。尝试：

$content =~ /<qid\"\s*value=([^>][0-9]+)/;

Answer 5

对于您提供的示例，您的正则表达式应如下所示：

$content =~ m{
               \"       # match a double quote
               qid      # match the string: qid
               \"       # match a double quote
               [^>]*    # match anything but the closing >
               value    # match the string: value
               \=       # match an equal sign
               \"       # match a double quote
               (\d+)    # capture a string of digits
               \"       # match a double quote
             }msx;

为什么regexp找不到该数字

5 个答案: