正则表达式捕获文字周围的文本

时间:2012-03-30 14:27:59

标签: c# regex

我的文字如下:

Title A
some description on a few lines, there may be empty lines here
some description on a few lines
Status: some random text
Title B
some description on a few lines, there may be empty lines here
some description on a few lines
Status: some other random text
Title C
some description on a few lines, there may be empty lines here
some description on a few lines
Status: some other random text

我想根据文字Status:解析文本并获取一系列项目,每个项目都包含标题,描述行和状态。我正在使用C#4.0。

4 个答案:

答案 0 :(得分:1)

我就是这样做的(假设从文本文件中读取):

Regex regStatus = new Regex(@"^Status:");
Regex regTitle = new Regex(@"^Title:");
string line;
string[] decriptionLine;
string[] statusLine;
string[] titleLine;
using(TextReader reader = File.OpenText("file.txt"))
{
    while(reader.Peek() > 0)
    {
       line = reader.ReadLine();
       if(regStatus.IsMatch(line))
       {
          // status line, convert to array, can drop first element as it is "status"
          statusLine = line.Split(' '); 
          // do stuff with array
       }
       else if(regTitle.IsMatch(line))
       {
          // title line, convert to array can drop first element as it is "title"
          titleLine = line.Split(' ');
          // do stuff with array
       }
       else
       {
          // description line, so just split into array
          decriptionLine = line.Split(' ');
          // do stuff with array
       }
    }
}

然后,您可以根据需要获取数组并将其存储在某个类中。我会把它留给你弄清楚。它只是使用一个简单的正则表达式来检查行是否以 “状态:”或“标题:”。说实话,甚至都不需要。你可以这样做:

if(line.StartsWith("Status:")) {} 
if(line.StartsWith("Title:")) {}

检查每一行是以状态还是标题开头。

答案 1 :(得分:1)

如果内容的结构与您描述的相似,则可以缓冲文本

string myRegEx = "^String:.*$";

// loop through each line in text

    if (System.Text.RegularExpressions.Regex.IsMatch(line, myRegEx))
    {
        // save the buffer into array
        // clear the buffer
    }
    else
    {
        // save the text into the buffer
    }

答案 2 :(得分:1)

声明项目类型

public class Item
{
    public string Title { get; set; }
    public string Status { get; set; }
    public string Description { get; set; }
}

然后将文本拆分为行

string[] lines = text.Split(new[] { "\r\n" }, StringSplitOptions.None);

或者使用

读取文件中的行
string[] lines = File.ReadAllLines(path);

创建将存储结果的项目列表

var result = new List<Item>();

现在我们可以进行解析

Item item;
for (int i = 0; i < lines.Length; i++) {
    string line = lines[i];
    if (line.StartsWith("Title ")) {
        item = new Item();
        result.Add(item);
        item.Title = line.Substring(6);
    } else if (line.StartsWith("Status: ")) {
        item.Status = line.Substring(8);
    } else { // Description
        if (item.Description != null) {
            item.Description += "\r\n";
        }
        item.Description += line;
    }
}

请注意,此解决方案没有错误处理。此代码假定输入文本始终格式正确。

答案 3 :(得分:0)

string data = @"Title A 


Status: Nothing But Net! 
Title B 
some description on a few lines, there may be empty lines here 
some description on a few lines 
Status: some other random text 
Title C 
Can't stop the invisible Man 
Credo Quia Absurdium Est
Status: C Status";

string pattern = @"
^(?:Title\s+)
 (?<Title>[^\s]+)
 (?:[\r\n\s]+)
 (?<Description>.*?)
  (?:^Status:\s*)
  (?<Status>[^\r\n]+)
";

// Ignorepattern whitespace just allows us to comment the pattern over multiple lines.
Regex.Matches(data, pattern, RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace)
     .OfType<Match>()
     .Select (mt => new
        {
            Title = mt.Groups["Title"].Value,
            Description = mt.Groups["Description"].Value.Trim(),
            Status = mt.Groups["Status"].Value.Trim()
        })
      .ToList() // This is here just to do the display of the output
      .ForEach(item => Console.WriteLine ("Title {0}: ({1}) and this description:{3}{2}{3}", item.Title, item.Status, item.Description, Environment.NewLine));

输出:

Title A: (Nothing But Net!) and this description:


Title B: (some other random text) and this description:
some description on a few lines, there may be empty lines here 
some description on a few lines

Title C: (C Status) and this description:
Can't stop the invisible Man 
Credo Quia Absurdium Est