Question

想象一下，有一个非常大的html文件当然有很多html标签。我无法将整个文件加载到内存中。

我的目的是提取此<p>和此</p>字符串的所有索引。我应该如何实现它？请为我建议一些指示。

Answer 1

使用文件流，您应该能够以几kb的大小加载文件。在加载每个块时保留当前文件位置的索引。扫描块以查找您要查找的字符串，并将其偏移量添加到索引中。保留您找到的所有索引的列表。

Answer 2

Answer 3

使用文件流的示例：

/// <summary>
/// Get a collection of index,string for everything inside p tags in the html file
/// </summary>
/// <param name="htmlFilename">filename of the html file</param>
/// <returns>collection of index,string</returns>
private Dictionary<long, string> GetHtmlIndexes(string htmlFilename)
{
    //init result
    Dictionary<long, string> result = new Dictionary<long, string>();

    StreamReader sr = null;
    try
    {
        sr = new StreamReader(htmlFilename);
        long offsetIndex = 0;
        while (!sr.EndOfStream)
        {

            string line = sr.ReadLine(); //assuming html isn't condensed into 1 single line
            offsetIndex += line.Length;  //assuming 'index' you require is the file offset
            int openingIndex = line.IndexOf(@"<p");
            int closingIndex = line.IndexOf(@">");
            if ( openingIndex > -1)
            {
                int contentIndex = openingIndex + 3; // as in <p tag or <p>tag
                string pTagContent = line.Substring( contentIndex);
                if(closingIndex> contentIndex)
                {
                    int tagLength = closingIndex - contentIndex;
                    pTagContent = line.Substring( contentIndex, tagLength);
                }
                //else, the tag finishes on next or subsequent lines and we only get content from this line

                result.Add(offsetIndex + contentIndex, pTagContent);
            }


        } //end file loop

    }
    catch (Exception ex)
    {
        //handle error ex
    }
    finally
    {
        if(sr!=null)
            sr.Close();
    }


    return result;
}

这有一些限制，您可以从评论中看到。我怀疑使用LINQ会更加整洁。我希望这能给你一个起点吗？

Answer 4

如果您的html是纯XHTML，那么您可以将其视为XML文档。将您的XHTML加载到System.Xml.XmlDocument中，然后使用GetElementsByTagName("p")方法返回＆lt; p＆gt; -tags列表。这比直接解析html更安全，更容易。

Answer 5

我首先要创建一个HTML标记器，使用IEnumerable，yield return等将是直截了当的。它可以使用StreamReader.Read读取char-by-char文件，状态机switch将决定当前的状态并生成一系列令牌或Tuple s。

我找到了一个旧的HTML标记器here（Chris Anderson的旧 BlogX 博客引擎的一部分），可以调整为成为问题的可流式解决方案的基础。

如何检索大文件中的所有字符串索引？

5 个答案: