Question

我尝试过几种不同的方法，但都没有。

假设我的字符串定义如下：

<form>
  <div class="col_half">
    <input type="text" name="ScheduledVisit" placeholder="Scheduled Visit" class="sm-form-control border-form-control datetimepicker" id="ScheduledVisit" />
  </div>

  <div class="col_half col_last">
    <input type="text" name="company" class="sm-form-control border-form-control" placeholder="company" id="company" />
  </div>

  <div class="col_two_third">
    <input type="text" name="contactPerson" placeholder="Contact Person" class="sm-form-control border-form-control" id="contact" />
  </div>
  <input type="submit">
</form>

假设我要解析方形括号内的两个字符串。我编译了以下正则表达式：

s = '[မန္း],[aa]'.decode('utf-8')

然后我使用以下方式查找事件：

pattern = re.compile(r'\[(\w+)\]', re.UNICODE)

结果基本上只是pattern.findall(s, re.UNICODE)而不是两个匹配的预期列表。此外，如果我从findall调用中删除re.UNICODE，我会得到单个字符串[]，即非unicode 1：

[u'aa']

当然

pattern.findall(s)

返回s = '[bb],[aa]'.decode('utf-8') pattern.findall(s)

让事情变得更有趣：

[u'bb', u'aa']

返回s = '[မနbb],[aa]'.decode('utf-8') pattern.findall(s)

Answer 1

实际上相当简单。 \w匹配所有字母数字字符，而不是您的初始字符串中的所有字符都是字母数字。

如果您仍希望匹配括号中的所有字符，则一种解决方案是匹配除结束括号（]）之外的所有字符。这可以作为

import re
s = '[မန္း],[aa]'.decode('utf-8')
pattern = re.compile('\[([^]]+)\]', re.UNICODE)
re.findall(pattern, s)

[^]]创建除了旋律（^）字符后面的所有字符的匹配模式。

另请注意，re.UNICODE的{{1}}参数不是必需的，因为模式本身不包含任何unicode字符。

Answer 2

首先，请注意，如果您以UTF-8编码保存了源文件，并且在文件顶部声明了源代码编码，则以下内容仅适用于Python 2.x.否则，源的默认编码假定为ascii：

#coding: utf8
s = '[မန္း],[aa]'.decode('utf-8')

编写它的一种较简单的方法是直接编写Unicode字符串：

#coding: utf8
s = u'[မန္း],[aa]'

接下来，\w匹配字母数字字符。使用re.UNICODE标志，它匹配Unicode数据库中归类为字母数字的字符。并非မန္း中的所有字符都是字母数字。如果您想要括号之间的任何内容，请使用以下内容。请注意使用.*?进行非贪婪的匹配。对所有文本使用Unicode字符串，特别是正则表达式使用原始字符串也是一个好习惯。

#coding:utf8
import re
s = u'[မန္း],[aa],[မနbb]'
pattern = re.compile(ur'\[(.*?)\]')
print re.findall(pattern,s)

输出：

[u'\u1019\u1014\u1039\u1038', u'aa', u'\u1019\u1014bb']

请注意，Python 2在列表中显示字符串的明确版本，其中包含非ASCII和不可打印字符的转义码。

要查看实际的字符串内容，请打印字符串，而不是列表：

for item in re.findall(pattern,s):
    print item

输出：

မန္း
aa
မနbb

Python Regex - 在字符串中匹配混合的Unicode和ASCII字符

2 个答案: