Question

我有几个包含混合字符的文本字符串：缅甸字母，拉丁字母，数字。我需要能够将文本分成不同的类别。这是一个例子：

အေရာင္ဆန္းဆန္းေလး ေတြ ပါတဲ့ Enhancing Eyes shawdow palette ေလးပါ ။ 
Price - 17000 ks. Call 625555555

我可以使用正则表达式识别数字

re.findall("\d+", data)

但我无法弄清楚如何分割这两个字母。由此产生的分裂并不需要连贯 - 我只需要2个单独的转储，一串缅甸语和一串英语。有没有人对如何识别它有任何建议？

Answer 1

似乎你想要输出如下。

>>> import re
>>> s = '''အေရာင္ဆန္းဆန္းေလး ေတြ ပါတဲ့ Enhancing Eyes shawdow palette ေလးပါ ။ 
Price - 17000 ks. Call 625555555'''
>>> re.findall(r'\d+|[^A-Za-z]+|[A-Za-z\s]+', s)
['အေရာင္ဆန္းဆန္းေလး ေတြ ပါတဲ့ ', 'Enhancing Eyes shawdow palette ', 'ေလးပါ ။ \n', 'Price ', '- 17000 ', 'ks', '. ', 'Call ', '625555555']

Answer 2

如下：

import re
teststring = """အေရာင္ဆန္းဆန္းေလး ေတြ ပါတဲ့ Enhancing Eyes shawdow palette ေလးပါ ။ 
Price - 17000 ks. Call 625555555"""

Numbers = re.findall("\d+", teststring)
Latin = re.findall("[A-Za-z]+", teststring)
Burmese = re.findall("[^A-Za-z0-9 ]+", teststring)
print Numbers, Latin,
print Burmese

Python解析拉丁字母的文本

2 个答案: