Question

我的网站上有一个搜索栏，用于搜索网站中包含特定关键字的所有网页。这是通过查询索引服务器目录来实现的。

我的问题如下，假设我搜索“ASP.NET”这个词，我说3页包含“ASP.NET”的出现。

我想显示找到关键字“ASP.NET”的行（以便用户获取上下文信息）。

任何人都可以帮我吗???这真的很紧急。提前谢谢！

Answer 1

使用System.Xml.Linq将页面读入XDocument。使用linq查询文本的XDocument，然后返回XElement并进一步询问此元素。

Answer 2

尝试解析文档，找到搜索词的出现次数，然后提取周围的文本。这可以通过获取同一标记内的所有文本来完成，或者在同一个句子中获取所有文本。你可以用正则表达式做到这一点。

最佳效果取决于您的需求和内容的结构。您还可以包含周围的句子，以便获得所提取文本的最小长度。

这是一个例子，试图在这个问题中提取包含“问题”一词的句子。它绝不是完美的，但它说明了这个概念，应该让你开始：

using System;
using System.Net;
using System.Text.RegularExpressions;
class Program
{
    private const string url =
        "http://stackoverflow.com/questions/1655313/get-the-static-text-contents-of-a-web-page";
    private const string keyword = "question";

    private const string regexTemplate = ">([^<>]*?{0}[^<>]*?)<";
    static void Main(string[] args)
    {
        WebClient client = new WebClient();
        string html = client.DownloadString(url);
        Regex regex = new Regex(string.Format(regexTemplate,keyword) , RegexOptions.IgnoreCase);
        var matches = regex.Matches(html);
        foreach (Match match in matches)
            Console.WriteLine(match.Groups[1].Value);
    }
}

获取网页的静态文本内容

2 个答案: