我正在尝试在2条评论中提取部分HTML。
这是测试代码:
Sub Main()
Dim base_dir As String = "D:\"
Dim test_file As String = base_dir & "72.htm"
Dim start_comment As String = "<!-- start of content -->"
Dim end_comment As String = "<!-- end of content -->"
Dim regex_pattern As String = start_comment & ".*" & end_comment
Dim input_text As String = start_comment & "some more html text" & end_comment
Dim match As Match = Regex.Match(input_text, regex_pattern)
If match.Success Then
Console.WriteLine("found {0}", match.Value)
Else
Console.WriteLine("not found")
End If
Console.ReadLine()
End Sub
以上作品。
当我尝试从磁盘加载实际数据时,以下代码失败。
Sub Main()
Dim base_dir As String = "D:\"
Dim test_file As String = base_dir & "72.htm"
Dim start_comment As String = "<!-- start of content -->"
Dim end_comment As String = "<!-- end of content -->"
Dim regex_pattern As String = start_comment & ".*" & end_comment
Dim input_text As String = System.IO.File.ReadAllText(test_file).Replace(vbCrLf, "")
Dim match As Match = Regex.Match(input_text, regex_pattern)
If match.Success Then
Console.WriteLine("found {0}", match.Value)
Else
Console.WriteLine("not found")
End If
Console.ReadLine()
End Sub
HTML文件包含开始和结束注释以及介于两者之间的大量HTML。 HTML文件中的某些内容使用阿拉伯语。
感谢和问候。
答案 0 :(得分:2)
尝试将RegexOptions.Singleline
传递到Regex.Match(...)
,如下所示:
Dim match As Match = Regex.Match(input_text, regex_pattern, RegexOptions.Singleline)
这将使Dot的.
匹配换行符。
答案 1 :(得分:0)
我不知道vb.net
,但.
是否与换行符匹配,或者您是否需要为此设置选项?请考虑使用[\s\S]
代替.
来添加换行符。