我遇到的问题是component变量不能从第二个mht路径开始循环。示例文件包含(file1.mht,file2.mht,file3.mht)。也许其中的包含是(aaaaaa,bbbbbb,cccccc)遵循文件的顺序。输出示例:file1.mht aaaaaa file2.mht bbbbbb file3.mht cccccc
当前结果是: 例如:file1.mht aaaaaa file2.mht aaaaaa file3.mht aaaaaa file1.mht aaaaaa
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Configuration;
using System.Collections.Specialized;
namespace ConsoleApp3
{
class Program
{
static void Main(string[] args)
{
DirectoryInfo mht_file = new DirectoryInfo(@"C:\Users\manchunl\Desktop\ADVI-test\");
string mht_text = "";
foreach (FileInfo f in mht_file.GetFiles("*.mht"))
{
try
{
using (StreamReader sr = new StreamReader(f.FullName))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.EndsWith("="))
{
line = line.Substring(0, line.Length - 1);
}
mht_text += line;
}
}
int start_index = mht_text.IndexOf("<HTML ");
int end_index = mht_text.IndexOf("</HTML>");
mht_text = mht_text.Substring(start_index, end_index + 7 - start_index);
mht_text = mht_text.Replace("=0D", "");
mht_text = mht_text.Replace("=00", "");
mht_text = mht_text.Replace("=0A", "");
mht_text = mht_text.Replace("=3D", "=");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(mht_text);
var table = doc.DocumentNode.SelectSingleNode("//table[3]");
string component = table.SelectSingleNode(".//tr[4]").SelectSingleNode(".//td[2]").InnerHtml;
Console.WriteLine(f.FullName + " " + component);
File.AppendAllText(@"C:\Users\manchunl\Desktop\ADVI-test\result\dataCollection.txt", f.FullName + component + Environment.NewLine);
}
catch (Exception e)
{
}
}
Console.ReadKey();
}
}
}
答案 0 :(得分:0)
一般建议:最小化变量范围。您可以在循环内使用mht_text
,并且不应在迭代之间共享它。
您的错误是string mht_text = "";
在循环外部被声明。结果,它在第二次迭代中不为空。
第一次迭代:mht_text = "<HTML>aaaaaa</HTML>"
。
第二次迭代:mht_text = "<HTML>aaaaaa</HTML><HTML>bbbbbbb</HTML>"
。
startIndex
和endIndex
找到第一个HTML标签。