答案 0 :(得分:3)
HTML是一种丰富的语言,因此您几乎肯定必须构建自定义解析器。您可以使用脚本转换作为调用html agility pack的数据源来帮助将源解析为适当的格式以用于进一步的转换/目标。敏捷包支持html - > xml转换,XPATH,XSLT等,所以你不必编写太多的自定义代码。
答案 1 :(得分:1)
Steve Homer的html敏捷包答案绝对值得调查。我自己从未尝试过,但codeplex的描述似乎令人鼓舞。话虽如此,以下是我过去使用C#脚本任务抓取HTML页面从Intranet网页返回状态代码所做的事情:
static bool GetTextBetweenTextBlocks(string input_expression, string left_text, string right_text, out string matched_text)
// Declare results variable.
bool results = false;
// Define the regular expression that needs to be found.
string regex_find = left_text + "(?'text'.*?)" + right_text;
// Match the string.
Match string_output = Regex.Match(input_expression, regex_find);
// Output results
if (string_output.Success.ToString() == "True")
matched_text = string_output.ToString().Substring(left_text.Length, string_output.Length - left_text.Length - right_text.Length);
results = true;
return results;
matched_text = "";
return results;
public void Main()
// Declare variables.
int CaseSensitiveVariable = Convert.ToInt32(Dts.Variables["CaseSensitiveVariableFromPackage"].Value.ToString());
string Internal_URL = "http://www.MySite.com/SomeWebPage.asp?cn=" + CaseSensitiveVariable.ToString("X");
Boolean fireAgainFlag = true;
Boolean StatusIWantToCheck = false;
string SomethingIWantToCheck = "";
// Try-Catch block.
// The WebRequest.
HttpWebRequest oWebrequest;
oWebrequest = (HttpWebRequest)WebRequest.Create(Internal_URL);
oWebrequest.Credentials = System.Net.CredentialCache.DefaultCredentials;
oWebrequest.UserAgent = "My SSIS Server Name";
oWebrequest.Method = "POST";
oWebrequest.Timeout = (1000 * 60 * 10);
oWebrequest.ProtocolVersion = HttpVersion.Version10;
// The WebResponse.
HttpWebResponse oWResponse;
oWResponse = (HttpWebResponse)oWebrequest.GetResponse();
Stream s = oWResponse.GetResponseStream();
StreamReader sr = new StreamReader(s);
String sReturnString = sr.ReadToEnd();
// Parse text for Pricing Plan section. Change flag to true if Enterprise or Pro Shipper plans are found.
bool includes_what_I_want_to_check = GetTextBetweenTextBlocks(sReturnString.Replace("\n", ""), "<td>Is it there? Let's check for this.</td>", "</td>", out SomethingIWantToCheck);
if (includes_what_I_want_to_check == true)
// Log what I want to check to the SSIS Events Log.
Dts.Events.FireInformation(0, "Something I Want To Check", SomethingIWantToCheck, "", 0, ref fireAgainFlag);
if (SomethingIWantToCheck.ToLower().Contains("Do I have this value?") || SomethingIWantToCheck.ToLower().Contains("Or Maybe I have this value?"))
StatusIWantToCheck = true;
// Log response and fail.
Dts.Events.FireError(0, "I could not find what I wanted in the Web Response", sReturnString.Replace("\n", ""), "", 0);
Dts.TaskResult = (int)ScriptResults.Failure;
catch (WebException e)
Dts.Events.FireError(0, "WebException", e.Message, "", 0);
// Log variable and write value to the package variable.
Dts.Events.FireInformation(0, "Status I Want to Check", StatusIWantToCheck.ToString(), "", 0, ref fireAgainFlag);
Dts.Variables["StatusIWantToCheck"].Value = StatusIWantToCheck;
// Return success.
Dts.TaskResult = (int)ScriptResults.Success;
行。上面的代码块充满了你可能想要或不想要的东西。上面的代码执行网页的HTTP帖子,读取响应,在文本中搜索特定的代码块,并使用IF THEN ELSE子句处理相关数据。它还包括将变量值写出到包中以跟踪发生的情况的示例。我依靠日志记录来解决错误,特别是在我调整代码时。如果在脚本任务中找不到某些文本块,脚本任务也会设置为失败。