档案格式
POS ID PosScore NegScore SynsetTerms Gloss
a 00001740 0.125 0 able#1" able to swim"; "she was able to program her computer";
a 00002098 0 0.75 unable#1 "unable to get to town without a car";
a 00002312 0 0 dorsal#2 abaxial#1 "the abaxial surface of a leaf is the underside or side facing away from the stem"
a 00002843 0 0 basiscopic#1 facing or on the side toward the base
a 00002956 0 0.23 abducting#1 abducent#1 especially of muscles; drawing away from the midline of the body or from an adjacent part
a 00003131 0 0 adductive#1 adducting#1 adducent#1 especially of muscles;
在此文件中,我想提取(ID,PosScore,NegScore和SynsetTerms)字段。 (ID,PosScore,NegScore)字段数据提取很简单,我使用以下代码来获取这些字段的数据。
Regex expression = new Regex(@"(\t(\d+)|(\w+)\t)");
var results = expression.Matches(input);
foreach (Match match in results)
{
Console.WriteLine(match);
}
Console.ReadLine();
并且它给出了正确的结果但是归档 SynsetTerms 会产生问题,因为有些行有两个或更多单词,所以如何组织单词并反对它PosScore和NegScore。
例如,在第五行中有两个单词abducting#1
和abducent#1
,但两者都有相同的分数。
那么获得Word及其得分的行是什么样的正则表达式,如:
Word PosScore NegScore
abducting#1 0 0.23
abducent#1 0 0.23
答案 0 :(得分:5)
非正则表达式,字符串拆分版本可能更容易:
var data =
lines.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
.Skip(1)
.Select(line => line.Split('\t'))
.SelectMany(parts => parts[4].Split().Select(word => new
{
ID = parts[1],
Word = word,
PosScore = decimal.Parse(parts[2]),
NegScore = decimal.Parse(parts[3])
}));
答案 1 :(得分:1)
您可以使用此正则表达式
^(?<pos>\w+)\s+(?<id>\d+)\s+(?<pscore>\d+(?:\.\d+)?)\s+(?<nscore>\d+(?:\.\d+)?)\s+(?<terms>(?:.*?#[^\s]*)+)\s+(?<gloss>.*)$
您可以创建一个像这样的列表
var lst=Regex.Matches(input,regex)
.Cast<Match>()
.Select(x=>
new
{
pos=x.Groups["pos"].Value,
terms=Regex.Split(x.Groups["terms"].Value,@"\s+"),
gloss=x.Groups["gloss"].Value
}
);
现在你可以迭代它了
foreach(var temp in lst)
{
temp.pos;
//you can now iterate over terms
foreach(var t in temp.terms)
{
}
}