我是正规表达的新手。
我有格式的字符串,如
1)372万人(国家排名:第6位)(2004年估计)
2)10000人(2007年估计)
我想从这两种字符串中提取种群数和时间。我怎么能用C#中的正则表达式来做呢。或者我需要编写多个正则表达式吗?
答案 0 :(得分:3)
这是一个起点:
(?<population>\d\(.\d+)?) #capturing group named "population"
#that is a number, optionally followed by a
#decimal point and at least one number
\s* #followed by one or more spaces
(?<magnitude>thousand|(m|b)illion)? #optional capturing group named "magnitude"
# that matches "thousand", "million", or "billion"
\s* #one or more whitespace characters
people #the literal "people"
.* #match any number of characters
\( #Find literal opening parentheses...
(?<year>\d{4}) #...followed by a four-digit year...
\s #...followed by a space...
estimate\) #...followed by the phrase "estimate)"
\s*$ #followed by optional whitespace
#and the end of the string
显示用法的简单驱动程序:
class Program
{
/// Generate test strings
static IEnumerable<string> Generator()
{
yield return "3.72 million people (country rank: 6th) (2004 estimate)";
yield return "10000 people (2007 estimate)";
}
public static void Main()
{
string expression = @"
(?<population>\d(.\d+)?) #capturing group named 'population'
#that is a number, optionally followed by a
#decimal point and at least one number
\s* #followed by one or more spaces
(?<magnitude>thousand|(m|b)illion)? #optional capturing group named 'magnitude'
# that matches 'thousand', 'million', or 'billion'
\s* #one or more whitespace characters
people #the literal 'people'
.* #match any number of characters
\( #Find literal opening parentheses...
(?<year>\d{4}) #...followed by a four-digit year...
\s #...followed by a space...
estimate\) #...followed by the phrase 'estimate'
\s*$ #followed by optional whitespace
#and the end of the string";
RegexOptions options =
RegexOptions.IgnorePatternWhitespace // allow whitespace/comments
| RegexOptions.IgnoreCase
| RegexOptions.ExplicitCapture; // Only capture named groups
Regex r = new Regex(expression, options);
foreach (var test in Generator())
{
Match match = r.Match(test);
if (!match.Success)
Console.WriteLine("Could not match {0}", test);
else
{
double population = double.Parse(match.Groups["population"].Value);
if (match.Groups["magnitude"].Success) // magnitude is optional
// but if present, need to
// multiply population
{
switch (match.Groups["magnitude"].Value.ToLower())
{
case "thousand": population *= 1000; break;
case "million": population *= 1E6; break;
case "billion": population *= 1E9; break;
default: throw new FormatException("Unexpected value in magnitude group");
}
}
int year = int.Parse(match.Groups["year"].Value);
Console.WriteLine("In {0}, population was {1} people.", year, population);
}
}
}
输出:
In 2004, population was 3720000 people.
In 2007, population was 10000 people.
答案 1 :(得分:2)
尝试:
(?<number>\d+.\d*)(?: million)? people(?: \(country rank: 6th\))? \((?<year>\d+) estimate\)
在http://regexhero.net/tester/上,它会给出以下结果:
答案 2 :(得分:1)
如果你的目标是这种模式,请尝试下一个Regex
:
[population/number and text] people [some text] ([date] estimate)
正则表达式:
var match = Regex.Match(inputString,
@"(?<number>[\.\d]+(\s+\w+)?)\s+people .+\((?<date>\d+)\s+estimate\)");
var population = match.Groups["number"].Value;
var date = match.Groups["date"].Value;
答案 3 :(得分:1)
您可能需要两个正则表达式,因为您希望以不同方式处理它们。 我复制了你的两行,包括“1)”和“2)”。 这是人口(开头有空间):
\d+(?!\w)\.?(?=\d*)\d*
如果空格后跟一个字母,后跟一个或零点,则后跟一个或多个数字的空格,仅在下一个字符是一个或多个数字后才有效,后跟数字。 对于像万/千的单词,你必须用零替换它们。
然后是日期部分:
(?:\()\d{4}(?!\d)
匹配左括号而不记住它,如果第五个不是数字则匹配四位数。
希望有所帮助。老实说,我不太了解c#,我用JavaScript测试过。
编辑:其他人有更完整的答案,他们实际上是在c#,去看看。