Question

我使用regexp_tokenize()从阿拉伯文本中返回令牌而没有任何标点符号：

import re,string,sys
from nltk.tokenize import  regexp_tokenize

def PreProcess_text(Input):
  tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
  return tokens

H = raw_input('H:')
Cleand= PreProcess_text(H)
print  '\n'.join(Cleand)

它工作正常，但问题是当我尝试打印文本时。

文字ايمان،سعد的输出：

    ?يم
    ?ن
    ?
    ?
    ?

但如果文字是英文的，即使有阿拉伯标点符号，也会打印出正确的结果。

文字hi،eman的输出：

     hi
     eman

Answer 1

使用raw_input时，符号将被编码为字节。

您需要使用

将其转换为Unicode字符串

H.decode('utf8')

你可以保留你的正则表达式：

tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)

regexp_tokenize和阿拉伯语文本

1 个答案: