我有很多文件" .txt"文件中包含常用文本和xml标记的文件。文件非常大,文件数量非常多。所以我想要不带文字的xml。我知道
标签从<body>
开始,以</body>
结尾。我只需要在<body>
<body>
和所有嵌套代码
档案示例:
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
...
</body>
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
...
</body>
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
...
</body>
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
我试图使用XDocument doc = XDocument.Parse(str);
,但我有一个例外:
根级别的数据无效。第1行,第1位。
答案 0 :(得分:0)
尝试下面的代码。如果所有行都以“&lt;”开头,它将起作用。如果不是,我们可能需要使用正则表达式。
StreamReader reader = new StreamReader(FILENAME, Encoding.UTF8);
string inputLine = "";
string str = "";
while ((inputLine = reader.ReadLine()) != null)
{
if (inputLine.Trim().StartsWith("<"))
{
str += inputLine + "\n";
}
}
答案 1 :(得分:0)
虽然它不一定是个好主意,但只要你确定它被正确分隔(即&lt; as&lt; in非XML内容等) - XML允许您在元素中包含混合内容,即数据和嵌套元素的组合。
e.g。以下是有效的XML:
<FileContent>
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
...
</body>
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
...
</body>
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
...
</body>
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
</FileContent>
因此,如果您只是将其包装在一对标签中,则可以加载它。然后,您可以使用XPath访问body元素。
e.g。像(未经测试的)
public string GetBodyTagContent (string fileContent)
{
var xmlDoc = new System.Xml.XmlDocument();
xmlDoc.LoadXml("<FileContent>" + fileContent + "</FileContent>");
return string.Join(",", (from n in xmlDoc.SelectNodes("//body") select n.InnerText));
}
答案 2 :(得分:0)
html包含文件内容
resultList将给出正文内容列表
简短说明 - 它匹配两个body标签之间的所有文本。最后的*?
是非贪婪或懒惰的量词,并且允许在第一个<body>
和最后</body>
个标记中匹配多个正文标记而不是文本。
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
Regex regx = new Regex("<body>(?<bodyContents>.*?)</body>", options);
Match matchResult = regx.Match(html);
List<string> resultList = new List<string>();
while (matchResult.Success)
{
var d = matchResult.Groups["bodyContents"].Value;
resultList.Add(d.Trim());
matchResult = matchResult.NextMatch();
}
正则表达式适用于特定模式(正文标记之间的文本),但如果正文具有属性或html未正确形成,它将失败。