我是VB 2008.net的新手,我想做的是从下面的html中提取一些元素
<TABLE>
<TR>
<TD></TD>
<TH scope="col">PAT. NO.</TH><TD></TD><TH scope="col">Title</TH>
</TR>
<TR>
<TD valign=top>
10
</TD>
<TD valign=top>
<A HREF=/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=10&p=1&f=G&l=50&d=PTXT&S1=*a&OS=*a&RS=*a>8,519,110</A>
</TD>
<TD valign=baseline>
<IMG border=0 src="/netaicon/PTO/ftext.gif" alt="Full-Text">
</TD>
<TD valign=top>
<A HREF=/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=10&p=1&f=G&l=50&d=PTXT&S1=*a&OS=*a&RS=*a>mRNA cap analogs</A>
</TD>
所以我希望我的文本框显示如下
/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=10&p=1&f=G&l=50&d=PTXT&S1=*a&OS=*a&RS=*a
8,519,110
mRNA cap analogs
重复上面的html标记以获得更多的表行,并希望得到所有这些行,我已经读过我们可以使用“GetAttribute”来获取html元素,但我想提取一个特定的部分,如上所述上方。
答案 0 :(得分:1)
如果不理解为什么要这样做,那么给你一个很好的解决方案有点困难。
我将提供两个选项:
1)VB.NET - 目前尚不清楚如何在HTML中设置属性。你应该可以做类似的事情(注意:这是我对VB.net的记忆,并在这里手写,而不是使用VS.net):
HTML视图
<asp:HyperLink id="FirstLink" runat="server" />
...
<强>代码隐藏强>
FirstLink.NavigateUrl = yourUrlVariableHere
...
YourInputBox.Text = String.Concat(yourUrlVariableHere, yourOtherVariablesHere ...)
2)jQuery -
基本上,您希望获取属性然后显示它们:
$(function(){
var anchor1 = $("#firstAnchor").attr("href");
var imageSrc = $("#my-image").attr("src");
$("#my-display").html(anchor1+ "<br/>" + imageSrc );
});
完整样本here
答案 1 :(得分:1)
我有一个例程,我一直用来从HTML表中提取数据 (对不起,我不相信原作者,我发现这个代码并且不知道它来自哪里)。它以表格的字符串形式解析HTML并将单元格加载到数据集中。
Public Shared Function ConvertHtmlTablesToDataSet(html As String) As DataSet
Dim dt As DataTable
Dim ds As New DataSet()
dt = New DataTable()
Dim tableExpression As String = "<table[^>]*>(.*?)</table>"
Dim headerExpression As String = "<th[^>]*>(.*?)</th>"
Dim rowExpression As String = "<tr[^>]*>(.*?)</tr>"
Dim columnExpression As String = "<td[^>]*>(.*?)</td>"
Dim headersExist As Boolean = False
Dim iCurrentColumn As Integer = 0
Dim iCurrentRow As Integer = 0
Dim tables As MatchCollection = Regex.Matches(html, tableExpression, RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase)
For Each table As Match In tables
iCurrentRow = 0
headersExist = False
dt = New DataTable()
If table.Value.Contains("<th") Then
headersExist = True
Dim headers As MatchCollection = Regex.Matches(table.Value, headerExpression, RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase)
For Each header As Match In headers
dt.Columns.Add(header.Groups(1).ToString())
Next
Else
Dim myvar2222 As Integer = Regex.Matches(Regex.Matches(Regex.Matches(table.Value, tableExpression, RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase)(0).ToString(), rowExpression, RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase)(0).ToString(), columnExpression, RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase).Count
For iColumns As Integer = 1 To myvar2222
dt.Columns.Add("Column " + System.Convert.ToString(iColumns))
Next
End If
Dim rows As MatchCollection = Regex.Matches(table.Value, rowExpression, RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase)
Try
For Each row As Match In rows
If Not ((iCurrentRow = 0) And headersExist) Then
Dim dr As DataRow = dt.NewRow()
iCurrentColumn = 0
Dim columns As MatchCollection = Regex.Matches(row.Value, columnExpression, RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase)
For Each column As Match In columns
dr(iCurrentColumn) = column.Groups(1).ToString()
iCurrentColumn += 1
If iCurrentColumn = dt.Columns.Count Then Exit For
Next
dt.Rows.Add(dr)
End If
iCurrentRow += 1
Next
ds.Tables.Add(dt)
Catch ex As Exception
Stop
End Try
Next
Return ds
End Function