我正在使用VB.NET并尝试从随机句子中提取年份和国家;如果有两种可用:
我的输入将如下所示:
This is just the title and has no year or country:
Preamble with only year 1999 and no country:
I was born in 1990 in Canada, I was born to love, and be loved.
She was born in 2000 in Malaysia and she likes fishing.
My mother was born in South Africa and she love all her sons and daughters, she was born in 1960.
My Dad was born in a small village in France in 1955. He loves my Mom.
and finally thanks from USA, without a year.
我想从上面得到以下输出:
***EMPTY
***EMPTY
1990 - Canada
2000 - Malaysia
1960 - South Africa
1955 - France
***EMPTY
我整个上午都在阅读关于REGEX
的内容,我认为它可能会成功;
但我放弃了;
谁能帮忙;
感谢提前......
答案 0 :(得分:5)
假设您可以构建一个国家/地区列表,您可以将其组合成一系列更改,如下所示:
(Canada|Malaysia|France|South Africa)
必须优化长列表,但这是另一个故事(见下文)。
然后你可以使用这样的正则表达式:
^(?=.*(\b\d{4}\b))(?=.*\b(Canada|Malaysia|France|South Africa)\b)
要将年份和国家/地区捕获到第1组和第2组。在regex demo中,请参阅右侧窗格中的捕获。
<强>捕获强>
1990 Canada
2000 Malaysia
1960 South Africa
1955 France
优化国家/地区列表
首先,您需要组织清单,以便如果国名是另一个的子串 - 例如两个几内亚和几内亚比绍,苏丹和南苏丹,多米尼加和多米尼加共和国 - 最长的是第一个它有机会匹配。
您还需要了解您的输入。例如,您是否需要考虑 the United。和 the United States of America 等变体?
此外,您希望Fairyland
和Fantasyland
为Fa(?:ir|ntas)yland
,这有助于引擎更快地匹配(或失败)。有256个国家的列表,创建这样一个优化列表是一个挑战,但有些工具可以帮助您。想到regex-opt
和Regex::Assemble
。