Question

给出一个包含阿拉伯语和英语的混合字符串，我想从中删除任何英语字符或单词，仅保留一个阿拉伯语句子。以下代码不起作用。我该如何修改？

import string

text = 'انا أحاول أن أعرف من انت this is not'
maintext = ''.join(ch for ch in text if ch not in set(string.punctuation))
text = filter(lambda x: x==' ' or x not in string.printable , maintext)
print(text)

谢谢

Answer 1

您可以在此处尝试使用re.sub：

# -*- coding: utf-8 -*-
import re

text = 'انا أحاول أن أعرف من انت this is not'
output = re.sub(r'\s*[A-Za-z]+\b', '' , text)
output = output.rstrip()
print(output)

此打印：

انا أحاول أن أعرف من انت

作为旁注，由于我们不想使包围英语单词的两个阿拉伯单词融合在一起，因此我们在正则表达式模式\s*[A-Za-z]+中捕获了可能的前导空格。但是，这留下了在RHS上拖尾空白的可能性，因此我们调用rstrip()来删除它。

Answer 2

这是我的版本：

import string
import re

text = 'انا أحاول أن أعرف من انت this is not'
maintext = re.sub(r'[a-zA-Z]', '', text)
print(maintext)

Answer 3

所有其他答案都建议使用REGEX，但是您可以不使用regex而只使用字符串模块中的ascii字母

import string

text = 'انا أحاول أن أعرف من انت this is not'
text = "".join([char for char in text if char not in string.ascii_letters]).strip()
print(text)

输出

انا أحاول أن أعرف من انت

从阿拉伯字符串中删除英语单词

3 个答案: