vb.net从网站上刮痧

时间:2014-03-29 08:10:58

标签: regex vb.net scrape

所以我试图从网站上抓取用户名并遵循此教程

https://www.youtube.com/watch?v=FpAvBOhDrYk第一部分

https://www.youtube.com/watch?src_vid=FpAvBOhDrYk第二部分

并关注所有内容,但无法使其正常运行,但这是我使用的vb.net代码

  1. 导入System.Text.RegularExpressions

    Public Class Form1

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim Request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create("http://statigr.am/tag/anime")
        Dim response As System.Net.HttpWebResponse = Request.GetResponse
    
        Dim rs As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
    
        Dim rssourcecode As String = rs.ReadToEnd
    
        '<a href="/hannahotaku">hannahotaku</a>
    
        Dim r As New System.Text.RegularExpressions.Regex("<a href=""/.*"">hannahotaku</a>")
        Dim matches As MatchCollection = r.Matches(rssourcecode)
    
    
        For Each itemcode As Match In matches
            ListBox1.Items.Add(itemcode.Value.Split("""").GetValue(1))
    
        Next
    
    
    End Sub End Class
    
  2. 你可以看到我正在使用网站的statigram 我试图刮掉的来源是

    <a href="/hannahotaku">hannahotaku</a>
    

    请让我知道我做错了什么,因为我想刮掉 部分在

    之间
    (<a href="/**whatever username here**"></a>)
    

2 个答案:

答案 0 :(得分:0)

如果您想捕获整个链接:

(<a href="\/.+?">hannahotaku<\/a>)

如果您想捕获用户名:

<a href="\/(.+?)">hannahotaku<\/a>

从我所看到的,它的VB.net可能是:

<a href=""/(.+?)"">hannahotaku</a>

使用延迟匹配(+?)确保它只匹配所需的内容,没有额外的内容,以及加号以确保其中至少有一个单字母用户名,并且&# 39;不完全是空的。

P.S。我对vb.net不是很熟悉,所以如果有一些改编要做,请告诉我。

<强> DEMO

答案 1 :(得分:0)

请改用此正则表达式:

"<div><div>([^<]+)</div>"

在for循环中,使用itemcode.Groups(1).Value代替itemcode.Value.Split("""").GetValue(1)。这将为您提供div标签之间的部分。

要检索匹配项,请尝试将它们放入文件中:

Imports System.Text.RegularExpressions

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    Dim Request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create("http://statigr.am/tag/anime")
    Dim response As System.Net.HttpWebResponse = Request.GetResponse

    Dim rs As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())

    Dim rssourcecode As String = rs.ReadToEnd

    Dim r As New System.Text.RegularExpressions.Regex("<div><div>([^<]+)</div>")
    Dim matches As MatchCollection = r.Matches(rssourcecode)

    Using Dim addInfo = File.CreateText("c:\Textfile.txt")
        For Each itemcode As Match In matches
            addInfo.WriteLine(itemcode.Groups(1).Value)
        Next
    End Using


End Sub End Class
相关问题