单词边界正则表达式与Devnagari脚本的整个单词不匹配

时间:2019-06-10 05:46:19

标签: regex python-3.x unicode python-unicode

package cucumberTest;

import org.junit.runner.RunWith;
import cucumber.api.CucumberOptions;
import cucumber.api.junit.Cucumber;

@RunWith(Cucumber.class)
@CucumberOptions(
        features = "Feature"
        ,glue={"stepDefinition"}
        ,monochrome = false
        )

public class TestRunner {

}

此代码段适用于英语,但与Devnagari脚本一起使用,它也与部分单词匹配。

articles = ['a','an','the']
regex = r"\b(?:{})\b".format("|".join(word))
sent = 'Davis is theta'
re.split(regex,sent)
>> ['Davis ', ' theta']

预期产量

stopwords = ['कम','र','छ']
regex = r"\b(?:{})\b".format("|".join(stopwords))
sent = "रामको कम्पनी छ"
re.split(regex,sent)
>> ['', 'ामको ', '्पनी छ']

我正在使用python3。是错误还是我错过了什么?

  

我怀疑/ b匹配[a-zA-Z0-9],并且我正在使用unicode。除了此任务还有其他选择吗?

1 个答案:

答案 0 :(得分:1)

您可能要使用findall而不是split来使用此代码:

import re

stopwords = ['कम','र','छ']
reg = re.compile(r'(?!(?:{})(?!\S))\S+'.format("|".join(stopwords)))

sent = 'रामको कम्पनी छ'
print (reg.findall(sent))

此正则表达式避免使用单词边界,而单词边界不适用于Devanagri等Unicode文本。

RegEx Code Demo

Check: Python unicode regular expression matching failing with some unicode characters -bug or mistake?