答案 0 :(得分:3)
问题非常模糊,但这可能会让你开始。
HTML是一种丰富的语言,因此您几乎肯定必须构建自定义解析器。您可以使用脚本转换作为调用html agility pack的数据源来帮助将源解析为适当的格式以用于进一步的转换/目标。敏捷包支持html - > xml转换,XPATH,XSLT等,所以你不必编写太多的自定义代码。
答案 1 :(得分:1)
Steve Homer的html敏捷包答案绝对值得调查。我自己从未尝试过,但codeplex的描述似乎令人鼓舞。话虽如此,以下是我过去使用C#脚本任务抓取HTML页面从Intranet网页返回状态代码所做的事情:
static bool GetTextBetweenTextBlocks(string input_expression, string left_text, string right_text, out string matched_text)
{
// Declare results variable.
bool results = false;
// Define the regular expression that needs to be found.
string regex_find = left_text + "(?'text'.*?)" + right_text;
// Match the string.
Match string_output = Regex.Match(input_expression, regex_find);
// Output results
if (string_output.Success.ToString() == "True")
{
matched_text = string_output.ToString().Substring(left_text.Length, string_output.Length - left_text.Length - right_text.Length);
results = true;
return results;
}
else
{
matched_text = "";
return results;
}
}
此函数将返回在另外两个文本字符串之间出现的文本字符串的第一次出现。您可以使用更有用的功能替换它,以满足您的特定需求。
public void Main()
{
// Declare variables.
int CaseSensitiveVariable = Convert.ToInt32(Dts.Variables["CaseSensitiveVariableFromPackage"].Value.ToString());
string Internal_URL = "http://www.MySite.com/SomeWebPage.asp?cn=" + CaseSensitiveVariable.ToString("X");
Boolean fireAgainFlag = true;
Boolean StatusIWantToCheck = false;
string SomethingIWantToCheck = "";
// Try-Catch block.
try
{
// The WebRequest.
HttpWebRequest oWebrequest;
oWebrequest = (HttpWebRequest)WebRequest.Create(Internal_URL);
oWebrequest.Credentials = System.Net.CredentialCache.DefaultCredentials;
oWebrequest.UserAgent = "My SSIS Server Name";
oWebrequest.Method = "POST";
oWebrequest.Timeout = (1000 * 60 * 10);
oWebrequest.ProtocolVersion = HttpVersion.Version10;
// The WebResponse.
HttpWebResponse oWResponse;
oWResponse = (HttpWebResponse)oWebrequest.GetResponse();
Stream s = oWResponse.GetResponseStream();
StreamReader sr = new StreamReader(s);
String sReturnString = sr.ReadToEnd();
oWResponse.Close();
// Parse text for Pricing Plan section. Change flag to true if Enterprise or Pro Shipper plans are found.
bool includes_what_I_want_to_check = GetTextBetweenTextBlocks(sReturnString.Replace("\n", ""), "<td>Is it there? Let's check for this.</td>", "</td>", out SomethingIWantToCheck);
if (includes_what_I_want_to_check == true)
{
// Log what I want to check to the SSIS Events Log.
Dts.Events.FireInformation(0, "Something I Want To Check", SomethingIWantToCheck, "", 0, ref fireAgainFlag);
if (SomethingIWantToCheck.ToLower().Contains("Do I have this value?") || SomethingIWantToCheck.ToLower().Contains("Or Maybe I have this value?"))
{
StatusIWantToCheck = true;
}
}
else
{
// Log response and fail.
Dts.Events.FireError(0, "I could not find what I wanted in the Web Response", sReturnString.Replace("\n", ""), "", 0);
Dts.TaskResult = (int)ScriptResults.Failure;
}
}
catch (WebException e)
{
Dts.Events.FireError(0, "WebException", e.Message, "", 0);
}
// Log variable and write value to the package variable.
Dts.Events.FireInformation(0, "Status I Want to Check", StatusIWantToCheck.ToString(), "", 0, ref fireAgainFlag);
Dts.Variables["StatusIWantToCheck"].Value = StatusIWantToCheck;
// Return success.
Dts.TaskResult = (int)ScriptResults.Success;
}
行。上面的代码块充满了你可能想要或不想要的东西。上面的代码执行网页的HTTP帖子,读取响应,在文本中搜索特定的代码块,并使用IF THEN ELSE子句处理相关数据。它还包括将变量值写出到包中以跟踪发生的情况的示例。我依靠日志记录来解决错误,特别是在我调整代码时。如果在脚本任务中找不到某些文本块,脚本任务也会设置为失败。
祝你尝试实施的任何解决方案都好运。如果您对此代码段有任何疑问,请与我们联系。