Question

我一直在尝试从文本文件中提取文本，如果它们不在“＆lt;＆gt;”之间。同时，我希望将提取的单词打印在一个新行上。

这是问题：Write a program that extracts from an XML file the text only (without the tags)

示例输入： <?xml version="1.0"><student><name>Peter</name><age>21</age><interests count="3"><interest>Games</interest><interest>C#</interest>

期望的输出：

Peter 21 Games C# Java

我目前的输入是这样的：

Peter

21


 Games

C#

 Java

中间有空行。

这就是我的代码目前的样子。任何帮助，将不胜感激！如果你想知道，这是自学 - 作业。所以我不需要这样做。我不是在作弊。

    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Text;

    namespace Chapter_15_Question_10
    {
        class Program
        {
    static void Main(string[] args)
    {
        Console.WriteLine("This app extracts the words not in tags");

        StreamReader reader = new StreamReader(
            @"C:\Users\Nate\Documents\Visual Studio 2015\Projects\Chapter 15\Chapter 15 Question 10\Chapter 15 Question 10\TextFile1.txt");

        StringBuilder sb = new StringBuilder();

        using (reader)
        {
            string line = reader.ReadToEnd();
            bool isOpen = false;
            for (int i = 1; i < line.Length; i++)
            {

                if (line[i-1] == '<')
                {
                    isOpen = true;
                }

                if (line[i-1] == '>')
                {
                    isOpen = false;
                }

                if (isOpen)
                {
                    continue;
                }

                if (!(isOpen) && (line[i] != '<'))
                    Console.Write(line[i]);
                if(line[i] == '<')
                    Console.WriteLine();
            }
        }
    }
}

}

Answer 1

不要尝试通过逐行阅读和解析分隔符来自己解析XML。 .NET提供了一系列允许您阅读XML的类。

您正在寻找的是文本节点。

假设这个XML

var xml = "<?xml version=\"1.0\"?><student><name>Peter</name><age>21</age><interests count=\"3\"><interest>Games</interest><interest>C#</interest></interests></student>";

此版本使用更新的System.Xml.Linq命名空间，您可以使用Linq风格的查询来读取XML。

var doc = XDocument.Parse(xml); // Use XDocument.Load instead of parse to read from a file
foreach (var text in doc.DescendantNodes().Where(n => n.NodeType == System.Xml.XmlNodeType.Text))
{
    Console.WriteLine(text);
}

虽然此版本使用System.Xml命名空间，您可以使用XPath编写查询。

var doc = new XmlDocument();
doc.LoadXml(xml); // Use doc.Load to read from a file
foreach (XmlNode text in doc.SelectNodes("//text()"))
{
    Console.WriteLine(text.Value);
}

Answer 2

虽然我同意其他人的观点，但您应该使用其中一个.NET XML类，我认为这是作业，也许您的老师并不想要您。所以这是你的代码，修改过：

for (int i = 0; i < line.Length; i++) {
    if (line[i] == '<') {
        isOpen = true;
    }
    else if (line[i] == '>') {
        isOpen = false;
    }
    else if (!isOpen) {
        Console.Write(line[i]);
        if (i < line.length - 1 && '<' == line[i+1]) {
            Console.WriteLine();
        }
    }
}

Answer 3

这是Regex的完美用法：

using System;
using System.Text.RegularExpressions;

namespace RegexTest
{
    class Program
    {
        static void Main(string[] args) {
            string pattern = @"(?<=>)[^<]+(?=<)";
            Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
            string example = @"<?xml version=""1.0""><student><name>Peter</name><age>21</age><interests count=""3""><interest>Games</interest><interest>C#</interest>";
            MatchCollection results = rgx.Matches(example);
            foreach (Match m in results)
            {
                Console.WriteLine(m.Value);
            }
        }
    }
}

结果：

Peter
21
Games
C#

只要至少有一个字符，它就会返回>和<之间的任何文字。

Answer 4

几年前，我解决了同样的家庭作业。所以，我以前的解决方案现在看起来并不那么好。因为这个想法是练习＆＃34;手册＆＃34;文本解析我可以建议使用以下方法，更多地使用字符串操作：

<cfhttpparam name="body_html" type="formfield" value="#attributes.content#">

我强调这不是最佳解决方案，但它适用于这样的教育目的。

Answer 5

如果你愿意去追寻它序列

><

每次都会导致换行

bool writeON = false;
StringBuilder sb = new StringBuilder();
foreach (char c in line)
{
    if (c == '>')
        writeON = true;
    else  if (c == '<')
    {
        writeON = false;
        if (sb.Length > 0)
            Debug.WriteLine(sb.ToString());
        sb.Clear();
    }
    else if (writeON)
        sb.Append(c);
}
Debug.WriteLine("ddonce");

如何在“＆lt;＆gt;”之间提取文本

5 个答案: