Question

我有一个我希望合成的文件（.txt）。这些行看起来像这样=＆gt;

Name on Company
Street 7 CITY phone: 1234 - 56 78 91 Webpage: www.webpage.se
http://www.webpage.se

Name on Restaurant
Street 11 CITY CITY phone: 7023 - 51 83 83 Webpage:
http://

当我想要匹配城市（大写字母）时，我遇到的问题是我的正则表达式。到目前为止，我提出了这个=＆gt;

preg_match('/\b[A-ZÅÄÖ]{2,}[ \t][A-ZÅÄÖ]+|[A-ZÅÄÖ]{2,}\b/', $info, $city);

你可以看到它是瑞典城市，我正在与A-ZÅÄÖ合作。但是如果城市名称中的最后一个字符是“ÅÄÖ”，那么使用这个正则表达式就不起作用了。在这种情况下，只需要使用前面的字符。

有人看到这个问题吗？

提前致谢

Answer 1

FWIW，这似乎是完美的地方，可以使用http://txt2re.com从示例中开发和测试你的正则表达式。

话虽如此，正则表达式似乎没有任何问题导致它跳过尾随ÅÄÖ字符。这些与其他字母字符的处理方式没有区别。

我怀疑是Unicode问题。也许输入数据的尾随Ä存储为A，后跟单独的diaresis combining character。在应用正则表达式之前，解决方法是normalize the unicode string。

此外，正如Amber指出的那样，问题可能出在词边界的\b定义上。 docs说，A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.因此，您可以通过更改区域设置来获得帮助。

或者，如果输入为UTF-8，您可以尝试设置u pattern modifier。

Answer 2

您的问题是\b被定义为匹配\w中的字符与非\w中的字符之间的边界。

您的瑞典语特定字符不在[a-zA-Z0-9_]中（通常相当于\b）。

您可以使用适当的外观断言（example）替换{{1}}。

使用正则表达式的preg_match正在丢失最后一个字符

2 个答案: