Question

我想通过询问字符串是否包含unicode字符（在短序列中像短语）来返回true或false。

words = ['你好朋友','我吃饭'] # equivalent to 'hello friend', 'I had lunch'
uwords = []
for word in words:
    uwords.append(unicode(word,'utf8'))

uwords # [u'\u4f60\u597d\u670b\u53cb', u'\u6211\u5403\u996d']

import re
string = '他吃饭，不是我' # 'she had lunch', 'i did not'
usample = unicode(string, 'utf-8')
pattern = re.compile(u'[\b\u4f60\u597d\u670b\u53cb\b | \b\u6211\u5403\u996d\b]')
# pattern = re.compile(u'\u793e\u533a.*\u670d\u52a1') # [u'\u793e\u533a', u'\u670d\u52a1']
match = pattern.search(usample)

if match:
    print True
else:
    False

我必须从这段代码中获取False，但我得到了True。我认为我编写的re.compile有问题，似乎代码是单独捕获unicode字符而不是按顺序捕获。

我认为这对英语案例来说是一样的：

import re
string = 'rotten tomatoes are good'
pattern = re.compile('tomatoes are good | apples are good')
match = pattern.search(string)

if match:
    print True
else:
    False

当我想要的时候，这个人返回了假。

Answer 1

删除空格和括号解决了答案：

words = ['你好朋友','我吃饭'] # equivalent to 'hello friend', 'I had lunch'
uwords = []
for word in words:
    uwords.append(unicode(word,'utf8'))

uwords # [u'\u4f60\u597d\u670b\u53cb', u'\u6211\u5403\u996d']

import re
string = '他吃饭，不是我' # 'she had lunch', 'i did not'
usample = unicode(string, 'utf-8')
pattern = re.compile(u'\u4f60\u597d\u670b\u53cb|\u6211\u5403\u996d')
match = pattern.search(usample)

if match:
    print True
else:
    False

Answer 2

[abc]语法表示＆＃34;匹配a或b或c中的一个。同样\b对中文也不起作用，所以删除它也没关系。

请注意，如果使用Unicode字符串，则代码更短且更易读。

#coding语句声明了源文件的编码，因此Unicode字符串被正确转换。确保以声明的编码保存源。

#coding:utf8
import re

uwords = [u'你好朋友',u'我吃饭'] # equivalent to 'hello friend', 'I had lunch'

usample = u'他吃饭，不是我' # 'she had lunch', 'i did not'
usample2 = u'你好朋友， 你吃了吗？'

pattern = re.compile(u'你好朋友|我吃饭')

print bool(pattern.search(usample))
print bool(pattern.search(usample2))

输出：

False
True

使用re.compile的非英语unicode短语的正则表达式

2 个答案: