Question

我有很多包含urdu和english的文件。我必须搜索那些仅在乌尔都语中的单词。对于英语，我知道使用正则表达式不是问题。 r'[a-zA-Z]'但是如何使用正则表达式来表达urdu语言。

假设这是字符串

test="working جنگ test  بندی کروانا not good"

请指导。

Answer 1

使用this question将this information应用于urdu似乎这就是解决方案：

印度 - 阿拉伯数字代码点：U+0660 - U+0669

阿拉伯语字母代码点：U+0600 - U+06FF

在python3中，这非常简单：

使用此表达式：

r'[\u0600-\u06ff]'

示例：

>>> test="working جنگ test  بندی کروانا not good"
>>> test
'working جنگ test  بندی کروانا not good'
>>> import re
>>> re.findall(r'[\u0600-\u06ff]',test)
['ج', 'ن', 'گ', 'ب', 'ن', 'د', 'ی', 'ک', 'ر', 'و', 'ا', 'ن', 'ا']

通过添加+一次或多次运算符，您可以获得完整的单词。

>>> re.findall(r'[\u0600-\u06ff]+',test)
['جنگ', 'بندی', 'کروانا']

更新python 2.7正常工作

在python 2.x中，unicode很难。你必须在正则表达式前加ru前缀以将其标记为unicode然后它将找到正确的字形。你脚本中的第一行也应该是

`# -*- coding: utf-8 -*-`
test=u"working جنگ test  بندی کروانا not good"
myurdu="".join([unicode(letter) for letter in re.findall(ur'[\u0600-\u06ff]',test)])
print myurdu
>>> 
جنگبندیکروانا

有关详细信息，请转至declaring an encoding和unicode support in python。考虑切换到python3，因为如果要处理大量的urdu，unicode支持会更好。

Answer 2

另一种解决问题的方式

import re    
test=u"working جنگ test  بندی کروانا not good"
token=test.split(' ')
for w in token:
  status=re.search(ur'[\u0600-\u06ff]+',w)
  if status:
      print w

它适用于python版本2.7

如何使用python在字符串中查找urdu单词

2 个答案: