如何根据正则表达式将文本拆分为行?

时间:2016-06-30 15:04:07

标签: c# regex

我有文字片段,我想将这些分成几行。问题是它们已被格式化,所以我不能像我通常那样分裂:

 _text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
            .ToArray();

以下是示例文本:

 adj 1: around the middle of a scale of evaluation of physical
        measures; "an orange of average size"; "intermediate
        capacity"; "a plane with intermediate range"; "medium
        bombers" [syn: {average}, {intermediate}]
 2: (of meat) cooked until there is just a little pink meat
    inside
 n 1: a means or instrumentality for storing or communicating
      information
 2: the surrounding environment; "fish require an aqueous
    medium"
 3: an intervening substance through which signals can travel as
    a means for communication
 4: (bacteriology) a nutrient substance (solid or liquid) that
    is used to cultivate micro-organisms [syn: {culture medium}]
 5: an intervening substance through which something is
    achieved; "the dissolving medium is called a solvent"
 6: a liquid with which pigment is mixed by a painter
 7: (biology) a substance in which specimens are preserved or
    displayed
 8: a state that is intermediate between extremes; a middle
    position; "a happy medium"

格式始终相同:

  • 可能出现1-3个字母
  • 数字1-10
  • 结肠
  • 空间
  • 可能出现在多行的文字。

因此,在这种情况下,换行符必须是1-3个字符后跟一个1-2个字符的数字后跟一个:

有人可以给我一些关于如何通过拆分或其他方法做到这一点的建议吗?

更新:史蒂文的答案,但不太确定如何在我的功能中使用它。在这里,我展示了我的原始代码以及Steven建议的答案,但有一部分缺失,我不确定:

    public parser(string text)
    {
        //_text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
            // .ToArray();

        string pattern = @"(\w{1,3} )?1?\d: (?<line>[^\r\n]+)(\r?\n\s+(?<line>[^\r\n]+))*";
        foreach (Match m in Regex.Matches(text, pattern))
        {
            if (m.Success)
            {
                string entry = string.Join(Environment.NewLine,
                    m.Groups["line"].Captures.Cast<Capture>().Select(x => x.Value));
                // ...
            }
        }
    }

出于测试目的,此处的文本格式不同:

&#34; medium \ n adj 1:在物理\ n度量评估范围的中间; \#34;平均大小的橙色\&#34 ;; \&#34;中间\ n容量\&#34 ;; \#34;具有中间范围的平面\&#34 ;; \#34;中型\ n轰炸机\&#34; [syn:{average},{intermediate}] \ n 2 :(肉类)煮熟,直到里面只有一点粉红色的肉\ n 1:用于存储或传递信息的手段或工具\ n 2:周围环境; \&#34;鱼类需要含水培养基\&34; \ n 3:一种介入物质,信号可通过该物质作为通讯方式传播\ n 4 :(细菌学)营养物质(固体或液体)用于培养微生物[syn:{培养基}] \ n 5:实现某些事物的干预物质; \&#34;溶解介质被称为溶剂\&#34; \ n 6:一种液体,颜料由画家混合\ n 7 :(生物学)一种物质,其中标本被保存或展示\ 8:处于极端之间的状态;中间位置; \#34;一个快乐的媒介\&#34; \ n 9:作为生者和死者之间的媒介的人; \&#34;他咨询了几种媒体\&#34; [syn:{spiritist}] \ n 10:向公众广泛传播的传播\ n \ n [syn:{mass medium}] \ n 11:你特别适合的职业; \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n [syn:{metier}] \ n [还:{media}(pl)] \ n&#34;

2 个答案:

答案 0 :(得分:2)

Regex很适合这个。例如:

public parser(string text)
{
    string pattern = @"(?<line> (\w{1,3} )?1?\d: [^\r\n]+)(\r?\n(?! (\w{1,3} )?1?\d: [^\r\n]+)\s+(?<line>[^\r\n]+))*";
    var entries = new List<string>();
    foreach (Match m in Regex.Matches(text, pattern))
        if(m.Success)
            entries.Add(string.Join(" ", 
                m.Groups["line"].Captures.Cast<Capture>().Select(x=>x.Value)));
    _text = entries.ToArray();
}

答案 1 :(得分:2)

试试这个

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace ConsoleApplication106
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            string inputLine = "";
            List<Data> data = new List<Data>();
            string pattern = @"(?'prefix'\w*)?\s*?(?'index'\d+):(?'text'.*)";
            StreamReader reader = new StreamReader(FILENAME);
            while ((inputLine = reader.ReadLine()) != null)
            {
                inputLine = inputLine.Trim();
                Match match = Regex.Match(inputLine, pattern);
                Data newData = new Data();
                data.Add(newData);
                newData.prefix = match.Groups["prefix"].Value;
                newData.index = int.Parse(match.Groups["index"].Value);
                newData.text = match.Groups["text"].Value;
            }
        }
    }
    public class Data
    {
        public string prefix { get; set; }
        public int index { get; set; }
        public string text { get; set; }
    }
}