如何根据词汇表列表拆分字符串?

时间:2019-01-14 07:07:58

标签: regex split nlp tokenize

给出词汇表列表:

glossaries = ['USA', '34']

目标是使用词汇表中的项目,并使用词汇表作为分隔符分割字符串。例如。给定字符串和词汇表,可以使用_isolate_glossaries()函数:

glossaries = ['USA', '34']
word = '1934USABUSA'
_isolate_glossaries(word, glossaries)

应输出:

['19', '34', 'USA', 'B', 'USA']

我尝试过:

def isolate_glossary(word, glossary):
    print(word, glossary)
    # Check that word == glossary and glossary not in word
    if re.match('^{}$'.format(glossary), word) or not re.search(glossary, word):
        return [word]
    else:
        segments = re.split(r'({})'.format(glossary), word)
        segments, ending = segments[:-1], segments[-1] # Remove the last catch with null string.
        return segments

def _isolate_glossaries(word, glossaries):
    word_segments = [word]
    for gloss in glossaries:
        word_segments = [out_segment
                         for segment in word_segments 
                         for out_segment in isolate_glossary(segment, gloss)] 
    return word_segments

它可以工作,但看起来有点令人费解,无法进行如此多级别的循环和正则表达式拆分。 是否有更好的方法根据词汇表拆分字符串?

1 个答案:

答案 0 :(得分:2)

要按列表中的项目拆分字符串,请动态创建一个正则表达式,其中包括用管道|分隔的项目,这些项目都包含在捕获组中(非捕获组本身不包括项目在输出中):

list = re.split('({})'.format('|'.join(glossaries)), word);
print ([x for x in list if x]) # filter non-word items

请参见live demo here