如何从每行文本文件中删除子字符串

时间:2015-07-13 17:27:30

标签: c# regex

我们说我有以下格式的数据......

<sdf<xml>....</xml>...
.........<smc<xml>....
...</xml>...<ueo<xml>.
.... and goes on......

我的目标是从文件中逐行读取此数据,然后在检测到任何<xml>标记之前删除前面的4个字符。在上述情况下,将删除<sdf<smc<ueo

我现在已经写了以下内容..目前的正则表达式是错误的,无法正常工作..

while((line = reader.ReadLine()) !=null)
{
  writer.WriteLine(Regex.Replace(line, @"(?i)</(xml)(?!>)",</$1>),, string.Empty);         
}

4 个答案:

答案 0 :(得分:2)

你的总体思路和循环结构很好。它只是正则表达式匹配,需要一点工作:

while ((line = reader.ReadLine()) != null)
    writer.WriteLine(Regex.Replace(line, @"....<xml>", "<xml>"));

如果您希望使用<...<tag>形式的任何模式,您可以使用:

while ((line = reader.ReadLine()) != null)
    writer.WriteLine(Regex.Replace(line, @"<[^<>]{3}<([^<>]+)>", "<$1>"));

答案 1 :(得分:0)

你可以试试这个,

while((line = reader.ReadLine()) !=null)
{
  writer.WriteLine(Regex.Replace(line, @"(?is).{4}(?=<(\w+)\b[^>]*>.*?</\1>)" ,""), string.Empty);         
}

答案 2 :(得分:0)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication3
{
    class Program
    {
        private static string testData = "<sdf<xml><something/></xml><smc<xml><something/><ueo<xml><something /></xml>";
        static void Main(string[] args)
        {
            Func<string, string> stripInvalidXml = input => {
                Func<int, bool> shouldSkip = index =>
                {
                    var xI = index + 4; //add 4 to see what's after the current 4 characters
                    if (xI >= (input.Length - 5)) //make sure adding 4 and the length of <xml> doesn't exceed end of input
                        return false;
                    if (input.Substring(xI, 5) == "<xml>") //check if the characters 4 indexes after the current character are <xml>
                        return true; //skip the current index
                    return false; //don't skip
                };
                StringBuilder sb = new StringBuilder();
                for (int i = 0; i < input.Length; ++i)
                {
                    //loop through each character and see if the characters 4 after are <xml>
                    char c = input[i];
                    if (shouldSkip(i))
                        i += 3; //if should skip, we are already on the first character, so add 3 more to skip to skip 4 characters
                    else
                        sb.Append(c);
                }
                return sb.ToString();
            };
            Console.WriteLine(stripInvalidXml(testData));
            Console.ReadKey(true);
        }

    }
}

答案 3 :(得分:0)

尝试:

writer.WriteLine(Regex.Replace(s, @"<.{3}(<\w*>)", "$1"), string.Empty);

这假设解决方案应该与那些没有明确命名为<xml></xml>的标签匹配。