正则表达式代码在vb.net中的2条评论之间提取html无法正常工作

时间:2012-04-07 00:33:11

标签: regex vb.net html-parsing

我正在尝试在2条评论中提取部分HTML。

这是测试代码:

Sub Main()

    Dim base_dir As String = "D:\"
    Dim test_file As String = base_dir & "72.htm"

    Dim start_comment As String = "<!-- start of content -->"
    Dim end_comment As String = "<!-- end of content -->"

    Dim regex_pattern As String = start_comment & ".*" & end_comment
    Dim input_text As String = start_comment & "some more html text" & end_comment 

    Dim match As Match = Regex.Match(input_text, regex_pattern)


    If match.Success Then
        Console.WriteLine("found {0}", match.Value)
    Else
        Console.WriteLine("not found")
    End If

    Console.ReadLine()

End Sub

以上作品。

当我尝试从磁盘加载实际数据时,以下代码失败。

Sub Main()

    Dim base_dir As String = "D:\"
    Dim test_file As String = base_dir & "72.htm"

    Dim start_comment As String = "<!-- start of content -->"
    Dim end_comment As String = "<!-- end of content -->"

    Dim regex_pattern As String = start_comment & ".*" & end_comment
    Dim input_text As String = System.IO.File.ReadAllText(test_file).Replace(vbCrLf, "") 

    Dim match As Match = Regex.Match(input_text, regex_pattern)


    If match.Success Then
        Console.WriteLine("found {0}", match.Value)
    Else
        Console.WriteLine("not found")
    End If

    Console.ReadLine()

End Sub

HTML文件包含开始和结束注释以及介于两者之间的大量HTML。 HTML文件中的某些内容使用阿拉伯语。

感谢和问候。

2 个答案:

答案 0 :(得分:2)

尝试将RegexOptions.Singleline传递到Regex.Match(...),如下所示:

Dim match As Match = Regex.Match(input_text, regex_pattern, RegexOptions.Singleline)

这将使Dot的.匹配换行符。

答案 1 :(得分:0)

我不知道vb.net,但.是否与换行符匹配,或者您是否需要为此设置选项?请考虑使用[\s\S]代替.来添加换行符。