Question

我试图从此引用网址＆＃39; http://www.quotedb.com/quote/quote.php?action=random_quote＆＃39;中提取引文。我需要它来提取报价和可选的报价人。这是来自生成器的示例回复。

document.write（＆＃39;当没有什么似乎有帮助的时候，我去看看一个石匠在他的岩石上敲击他的一百次，而不是在它上面出现裂缝。然而，在第一百零一针它将分成两部分，我知道这不是那次打击，而是以前所有的一切。
＆＃39;）; document.write（＆＃39; 来自Jacob August Riis的更多报价＆＃39;）;

我知道我需要解析它以提取引用本身但我不确定如何这样做。我知道如何下载引用的字符串，但不知道如何提取它。所以这就是我目前的所有内容：

        Dim Cient As New System.Net.WebClient
    Dim grab = Cient.DownloadString("http://www.quotedb.com/quote/quote.php?action=random_quote")

非常感谢任何帮助！

Answer 1

其他人可能会想出更优雅的正则表达式，但这应该可行。只需要几个正则表达式来提取您感兴趣的返回数据部分。

Dim quote = RegEx.Matches(grab, "document\.write\('(.*?)<br>'\);")(0).Groups(1).Value
Dim author = RegEx.Matches(grab, "document\.write\('<i>.*?>(.*?)</a></i>'\);")(0).Groups(1).Value

Answer 2

我不是用Regex解析HTML的粉丝，但由于所有这些都可以说具有相同的语法，我们可以认为它适用于这种情况。

Dim pattern As String = <![CDATA[document\.write\('(?<quote>.*)<br\>'\);\ndocument\.write\('.*href=\"(?<url>[^\"]*)\">(?<author>[^<]*)</a>.*'\).*]]>.Value

Dim quoteRegex As New Regex(pattern, RegexOptions.Compiled Or RegexOptions.IgnoreCase Or RegexOptions.Singleline)

Dim Cient As New System.Net.WebClient
Dim grab = Cient.DownloadString("http://www.quotedb.com/quote/quote.php?action=random_quote")

Dim matches As MatchCollection = quoteRegex.Matches(grab)
For Each m As Match In matches
    Console.WriteLine("Quote: {0}", m.Groups("quote"))
    Console.WriteLine("Author: {0}", m.Groups("author"))
    Console.WriteLine("URL: {0}", m.Groups("url"))
Next

这会找到引号（第一个document.write()中的文本忽略引号和<br>标记），引号的作者（anchor标记的文本显示）然后更多引号的网址（锚点的href属性）

我通过使用XML文字来声明模式，这样我就不必将所有引号字符转义出来。

需要Imports System.Text.RegularExpressions

如何从Vb.net中的引用生成器中提取引用

2 个答案: