在文本文件中只获取Xml

时间:2016-10-07 12:06:38

标签: c# xml

我有很多文件&#34; .txt&#34;文件中包含常用文本和xml标记的文件。文件非常大,文件数量非常多。所以我想要不带文字的xml。我知道 标签从<body>开始,以</body>结尾。我只需要在<body>

中使用<body>和所有嵌套代码

档案示例:

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
 <body>
 ...
 </body>

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
 ...
</body>

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
 ...
</body>

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

我试图使用XDocument doc = XDocument.Parse(str);,但我有一个例外:

  

根级别的数据无效。第1行,第1位。

3 个答案:

答案 0 :(得分:0)

尝试下面的代码。如果所有行都以“&lt;”开头,它将起作用。如果不是,我们可能需要使用正则表达式。

            StreamReader reader = new StreamReader(FILENAME, Encoding.UTF8);
            string inputLine = "";
            string str = "";
            while ((inputLine = reader.ReadLine()) != null)
            {
                if (inputLine.Trim().StartsWith("<"))
                {
                    str += inputLine + "\n";
                }
            }

答案 1 :(得分:0)

虽然它不一定是个好主意,但只要你确定它被正确分隔(即&lt; as&lt; in非XML内容等) - XML允许您在元素中包含混合内容,即数据和嵌套元素的组合。

e.g。以下是有效的XML:

<FileContent>
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
 <body>
 ...
 </body>

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
 ...
</body>

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
<body>
 ...
</body>

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText

exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
exampleTextexampleTextexampleTextexampleTextexampleTextexampleText
</FileContent>

因此,如果您只是将其包装在一对标签中,则可以加载它。然后,您可以使用XPath访问body元素。

e.g。像(未经测试的)

public string GetBodyTagContent (string fileContent)
{
    var xmlDoc = new System.Xml.XmlDocument();
    xmlDoc.LoadXml("<FileContent>" + fileContent + "</FileContent>");
    return string.Join(",", (from n in xmlDoc.SelectNodes("//body") select n.InnerText));
}

答案 2 :(得分:0)

html包含文件内容 resultList将给出正文内容列表

简短说明 - 它匹配两个body标签之间的所有文本。最后的*?是非贪婪或懒惰的量词,并且允许在第一个<body>和最后</body>个标记中匹配多个正文标记而不是文本。

RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
Regex regx = new Regex("<body>(?<bodyContents>.*?)</body>", options);
Match matchResult = regx.Match(html);
List<string> resultList = new List<string>();
while (matchResult.Success)
{
       var d = matchResult.Groups["bodyContents"].Value;
       resultList.Add(d.Trim());          
       matchResult = matchResult.NextMatch();
}

正则表达式适用于特定模式(正文标记之间的文本),但如果正文具有属性或html未正确形成,它将失败。