Question

我有一个html文档，在解析后只包含格式化文本。我想知道是否有可能得到它的文本，如果我是鼠标选择它+复制+粘贴在新的文本文档？

我知道在Microsoft.Office.Interop中我可以使用.ActiveSelection属性来选择打开的Word的内容。

我需要找到一种方法来加载html（可能在浏览器对象中），然后复制其所有内容并将其分配给字符串。

var doc = new HtmlAgilityPack.HtmlDocument();
var documetText = File.ReadAllText(myhtmlfile.html, Encoding.GetEncoding(1251));
documetText = this.PerformSomeChangesOverDocument(documetText);
doc.LoadHtml(documetText);
var stringWriter = new StringWriter();
AgilityPackEntities.AgilityPack.ConvertTo(doc.DocumentNode, stringWriter);
stringWriter.Flush();
var titleNode = doc.DocumentNode.SelectNodes("//title");
if (titleNode != null)
{
    var titleToBeRemoved = titleNode[0].InnerText;
    document.DocumentContent = stringWriter.ToString().Replace(titleToBeRemoved, string.Empty);
}
else
{
    document.DocumentContent = stringWriter.ToString();
}

然后我返回文档对象。问题是字符串并不总是格式化，因为我希望它是

Answer 1

您应该可以使用StreamReader，当您阅读每一行时，只需使用StreamWriter

写出来

这样的内容将会读取文件的结尾并将其保存到新文件中。如果你需要在文件中做额外的逻辑，我会插入一个注释，让你知道在哪里做这些。

private void button4_Click(object sender, EventArgs e)
        {
            System.IO.StreamWriter writer = new System.IO.StreamWriter("C:\\XXX\\XXX\\XXX\\test2.html");
            String line;
            using (System.IO.StreamReader reader = new System.IO.StreamReader("C:\\XXX\\XXX\\XXX\\test.html"))
            {
                //Do until the end
                while ((line = reader.ReadLine()) != null) {
                //You can insert extra logic here if you need to omit lines or change them
                writer.WriteLine(line);
                }
                //All done, close the reader
                reader.Close();
            }
            //Flush and close the writer
            writer.Flush();
            writer.Close();

        }

您也可以将其保存为字符串，然后随意执行任何操作。您可以使用新行保持相同的格式。

编辑以下内容会考虑您的代码

  private void button4_Click(object sender, EventArgs e)
        {
            String line;
            String filetext = null;
            int count = 0;
            using (System.IO.StreamReader reader = new System.IO.StreamReader("C:\\XXXX\\XXXX\\XXXX\\test.html"))
            {
              while ((line = reader.ReadLine()) != null) { 
                if (count == 0) {
                    //No newline since its start
                    if (line.StartsWith("<")) {
                        //skip this it is formatted stuff
                    }
                    else {
                    filetext = filetext + line; 
                    }
                    }
                else {
                    if (line.StartsWith("<"))
                    {
                        //skip this it is formatted stuff
                    }
                    else
                    {
                        filetext = filetext + "\n" + line;
                    }
                }
                count++;                           
           }                
            Trace.WriteLine(filetext);                  
            reader.Close();
            }          
        }

我可以以编程方式复制我的HTML选择

1 个答案: