使用正则表达式提取对象及其属性

时间:2015-05-20 14:15:50

标签: c# regex

您好我正在使用包含字符串的List<T>结果 - 为了简化它,让我使用这样的词,但方案是相同的

01:01 A car consists of : wheels, engine, seats, 2 screws, a cotton lamp
01:02 A bike consists of : wheels
01:03 A car consists of : wheels, engine, seats, speakers, 5 screws, an indicator light
01:04 A small truck consists of : wheels, engine, seats, bed

因此伪匹配器和所需的输出将是。

00-99:0-99(space)A|An(space){get the car/bike or any other as object}(space)consists(space)of(space):{get the elements in here exploding the commas as attributes}

现在我在foreach循环中使用,它通过我的列表然后将行写入文本框。

Foreach(Message _msg in _objects.Messages){
    richTextBox1.AppendText(_msg.Text);
}

伪显示器,将整个句子添加到我的文本框中。

Foreach(Message _msg in _objects.Messages){
    richTextBox1.AppendText(parsefunction(_msg.Text));
}

parse function
{ 
    count(the elements exploaded , and list them)
    remove the unwanted parts of text
}

提取对象和属性后,我想根据它们是否包含计数来对它们求和,并从中删除a /。这部分是我被困住的地方。

所需的输出是 - 对任何重复项和出现的数量求和

2x Car
4x Wheels
3x Engine
3x Seats
7x Screws
1x Cotton Lamp
1x Bike
1x Speakers
1x Indicator Light
1x Small Truck
1x Bed

你能指点我至少Regex,也许我会自己计算其余部分,并在完成后分享。我认为它必须是一个将在循环中调用的函数。

1 个答案:

答案 0 :(得分:1)

这是我想出的(我确信它可以改进):

public static List<KeyValuePair<string, string[]>> ParseData(List<string> data)
{
    Regex regex = new Regex(@"^[\d]{2}:[\d]{2} A[n]? ([a-zA-Z\s]+) consists of : ([a-zA-Z,\s0-9]+)$");
    var elementMap = new List<KeyValuePair<string, string[]>>();

    for (int i = 0; i < data.Count; i++)
    {
        var match = regex.Match(data[i]);
        var attributes = match.Groups[2].Value.Split(new string[] { ", " }, StringSplitOptions.RemoveEmptyEntries);

        if (match.Success && match.Groups[1].Value.Length > 0)
            elementMap.Add(new KeyValuePair<string, string[]>(match.Groups[1].Value, attributes));
    }

    return elementMap;
}

public static Dictionary<string, int> GetIndexedData(List<KeyValuePair<string, string[]>> data)
{
    Dictionary<string, int> displayObjects = new Dictionary<string, int>();

    foreach (KeyValuePair<string, string[]> item in data)
    {
        if (displayObjects.ContainsKey(item.Key))
            displayObjects[item.Key]++;
        else
            displayObjects.Add(item.Key, 1);

        foreach (string key2 in item.Value)
        {
            string[] attributeValues = key2.Split(' ');
            int add = 1;
            string addValue = key2;
            int c = 0;

            if (attributeValues.Length > 1 && int.TryParse(attributeValues[0], out c))
            {
                add = c;
                addValue = attributeValues[1];
            }

            if (addValue.Substring(0, 2) == "a ")
                addValue = addValue.Substring(2);
            else if (addValue.Substring(0, 3) == "an ")
                addValue = addValue.Substring(3);

            if (displayObjects.ContainsKey(addValue))
                displayObjects[addValue] += add;
            else
                displayObjects.Add(addValue, add);
        }
    }

    return displayObjects;
}

使用:

List<string> data = new List<string>();
data.Add("01:01 A car consists of : wheels, engine, seats, 2 screws, a cotton lamp");
data.Add("01:02 A bike consists of : wheels");
data.Add("01:03 A car consists of : wheels, engine, seats, speakers, 5 screws, an indicator light");
data.Add("01:04 A small truck consists of : wheels, engine, seats, bed");
var elementMap = ParseData(data);

var displayObjects = GetIndexedData(elementMap);

foreach (string key in displayObjects.Keys)
{
    Console.WriteLine(key + ": " + displayObjects[key]);
}

基本上;此Regex模式(^[\d]{2}:[\d]{2} A[n]? ([a-zA-Z\s]+) consists of : ([a-zA-Z,\s0-9]+)$)将匹配您指示的任何构建完全的内容。你所要做的就是:

var match = regex.Match(data[i]);
// 'match.Groups[1].Value' is the name of the item
// 'match.Groups[2].Value' is the comma-separated list

// The following line will split all the attributes on ', ' therefore leaving them as just the words. (`wheels`, `engine`, `seats`)
var attributes = match.Groups[2].Value.Split(new string[] { ", " }, StringSplitOptions.RemoveEmptyEntries);

使用所有这些信息做你想做的事。

这做出以下假设:

  1. 数据将始终包含两个数字([\d]{2}),冒号(:)和另外两个数字([\d]{2}),一个空格() ,a(A)和可选的n([n]?)(对于AAn)和另一个空格();所有这一切都在行的最开始(^
  2. object([a-zA-Z\s]+)的名称可以包含:
    1. 信件(a-zA-Z
    2. 空格(\s
    3. 至少有一个这样的角色,并且尽可能多
  3. 接下来的单词为空格(),consists of,空格()和冒号(: )。
  4. attributes([a-zA-Z,\s0-9]+))的字词可以包含:
    1. 信件(a-zA-Z
    2. 逗号(,
    3. 空格(\s
    4. 数字(0-9
    5. 至少有一个这样的角色,并且尽可能多
  5. 这些属性将在字符串的末尾($
  6. 之后

    最后,假设attributes不是nullnothing - attributes中有至少一个字符。< / p>

    此外,此处还有错误检查。你应该根据需要添加它。