htmlAggilityPack跨节点搜索文本字符串

时间:2016-02-23 21:52:54

标签: c# search text html-agility-pack

我希望能够搜索从URL中删除的html文档,并验证该URL是否包含特定文本。 文本和URL都由用户提供,并且可以变化。 我用httpWeb请求

抓取URL
string quote = txtQuote.Text;
string sourceURL = txtURL.Text;
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(sourceURL);
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();

    if (response.StatusCode == HttpStatusCode.OK)
    {
        Stream receiveStream = response.GetResponseStream();
        StreamReader readStream = null;

        if (response.CharacterSet == null)
        {
            readStream = new StreamReader(receiveStream);
        }
        else
        {
readStream = new StreamReader(receiveStream,     
Encoding.GetEncoding(response.CharacterSet));
        }

        string data = readStream.ReadToEnd();


        response.Close();
        readStream.Close();

我还有一个html实体列表和我的数据库中的各种可能编码,我将其检索并传递给DataTable,以便我可以将任何编码更改为标准html实体并用标准空格替换非中断空格

DataTable encodings = new DataTable();
        string getEncodings = "select * from htmlentities";
        SqlCommand cmdGetEncodings = new SqlCommand(getEncodings, dbcon);
        encodings.Load(cmdGetEncodings.ExecuteReader());
        dbcon.Close();

        foreach (DataRow row in encodings.Rows)
        {
            string htmlentity = row[1].ToString();
            string deccode = row[2].ToString();
            string hexcode = row[3].ToString();

            data = data.Replace(deccode, htmlentity);
            data = data.Replace(hexcode, htmlentity);
      data = data.Replace(“ ”, “ “);
        }

然后我使用htmlAgilityPack将已删除和修改的html传递给新文档,并检索内部文本 HtmlDocument doc = new HtmlDocument();             doc.LoadHtml(数据);

        HtmlNode root = doc.DocumentNode;
        string innerText = root.InnerText;

现在我想知道,准确验证引号是否包含在innerText中的最佳方法是什么?我试过的一种方法是:     if(innerText.IndexOf(quote)!= -1)     {         Label1.Text =“found”;     }     其他{         Label1.Text =“未找到”;    }

但这不准确,它无法找到跨越节点的innerText(例如,在多个<p>上)。返回未找到的示例引用和URL:

“他年轻时的灵活覆盖点已经缩小到站立位置,只停止那些靠近他的球直接撞到他身上,”查理康诺利把它放在吉尔伯特身上,这是他关于格蕾丝生平的精彩小说。 “在澳大利亚队的第一局比赛中,只要球越过他,他就会非常清楚人群中的嘘声。”在比赛结束时,英格兰因为Ranjitsinhji的93而被选中,格雷斯告诉杰克逊:“这是杰克,我不会再玩了。“ 然后是唐·布拉德曼。故事如此着名,几乎不需要复述。 “我非常想做得好,”布拉德曼承认道。他被埃里克·霍利斯击中第二球,“一个完美的长度googly”,刚刚接触到他的球棒的内侧边缘,然后击倒了保释。如果他的得分只有四分,他的平均分数就会达到一百分之一。

网址:http://www.theguardian.com/sport/2016/feb/23/test-cricket-farewells-brendon-mccullum

但是,如果我只搜索第一段:

“他年轻时的灵活覆盖点已经缩小到站立位置,只停止那些靠近他的球直接撞到他身上,”查理康诺利把它放在吉尔伯特身上,这是他关于格蕾丝生平的精彩小说。 “在澳大利亚队的第一局比赛中,只要球越过他,他就会非常清楚人群中的嘘声。”在比赛结束时,英格兰因为Ranjitsinhji的93而被选中,格雷斯告诉杰克逊:“这是杰克,我不会再玩了。“

它会返回找到。 有没有办法实现检查文本,即使它跨越节点?

1 个答案:

答案 0 :(得分:1)

所以,如果您只是计划刮掉http://www.theguardian.com 这是一个简单的解决方案,因为Guardian的HTML代码非常简洁。

var hdoc = new HtmlDocument();
hdoc.LoadHtml(data); // or hdoc.Load(data) - depending on what you get from your request
var articleNodes = hdoc.DocumentNode.SelectNodes(@"//p"); // the 'p' nodes contains the article text
var quote = "my quote";
var article = string.Empty;
foreach (HtmlNode node in articleNodes)
{
   article += node.InnerText + " "; // added a whitespace so we dont mess up the text.
}

if (article.Contains(quote))
{
   return true;
}
else
{
   return false;
}

现在,如果你计划为任何给定的网址制作此广告,那么就会出现问题。
由于您不知道这些网址的html格式,因此最好的&#34; - 最好的我的意思是最简单和最有价值的解决方案如下:

var hdoc = new HtmlDocument();
hdoc.LoadHtml(data); // or hdoc.Load(data) - depending on what you get from your request
var articleNodes = hdoc.DocumentNode;
var quote = "my quote";
var text = string.Empty;
foreach (var node in articleNodes.InnerText)
{
    text += node + " "; // added a whitespace so we dont mess up the text.

    foreach (var htmlNode in articleNodes.ChildNodes)
    {
        text += htmlNode.InnerText + " ";

        foreach (var childNode in htmlNode.ChildNodes)
        {
            text += childNode.InnerText + " ";

            foreach (var childrensChildren in childNode.ChildNodes)
            {
                text += childrensChildren.InnerText + " ";
            }
        }
    }
}

if (text.Contains(quote))
{
    return true;
}
else
{
    return false;
}

最终,由于不知道您提供的网址的HTML代码,嵌套foreach语句可能会增加或减少。在运行任何foreach语句之前,必须对节点进行一些空检查 可能有更好的解决方案,这是我的2美分。

工作示例: 这返回true,我将文章的一部分复制+粘贴到quote变量中,并检查我们的文章字符串是否包含它。

string urlAddress = "http://www.theguardian.com/sport/2016/feb/23/test-cricket-farewells-brendon-mccullum";

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        string data = string.Empty;
        if (response.StatusCode == HttpStatusCode.OK)
        {
            Stream receiveStream = response.GetResponseStream();
            StreamReader readStream = null;

            if (response.CharacterSet == null)
            {
                readStream = new StreamReader(receiveStream);
            }
            else
            {
                readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
            }

            data = readStream.ReadToEnd();

            response.Close();
            readStream.Close();
        }

        var hdoc = new HtmlDocument();
        hdoc.LoadHtml(data); 
        var articleNodes = hdoc.DocumentNode.SelectNodes(@"//p"); // the 'p' nodes contains the article text
        var quote ="Sinatra couldn’t stand the song. His daughter Tina once said that her father thought it was “self-serving and self-indulgent”. By the end of the ’70s he was in the habit of introducing it by explaining how little he liked it. “I hate this song. I hate this song!” he said before performing it at Atlantic City in 1979. “I got it up to here, this goddamn song.” Of course when Sinatra died, pretty much every single TV and radio news show played him out with My Way, “the most obvious, ";
        var article = string.Empty;
        foreach (HtmlNode node in articleNodes)
        {
            article += node.InnerText + " "; // added a whitespace so we dont mess up the text.
        }

        bool containsQuote = false || article.Contains(quote); // bool is true if the quote is in the article.