多字表达式的字符串拆分问题

时间:2010-10-20 02:42:27

标签: python regex

我有一系列字符串,如:

'i would like a blood orange'

我还有一个字符串列表,如:

["blood orange", "loan shark"]

对字符串进行操作,我想要以下列表:

["i", "would", "like", "a", "blood orange"]

获取上述列表的最佳方法是什么?我一直在我的代码中使用re,但我对这个问题感到困惑。

3 个答案:

答案 0 :(得分:4)

这是一个相当简单的生成器实现:将字符串拆分为单词,将形成短语的单词组合在一起,然后生成结果。

(处理skip的方法可能更简洁,但出于某种原因,我正在填补空白。)

def split_with_phrases(sentence, phrase_list):
    words = sentence.split(" ")
    phrases = set(tuple(s.split(" ")) for s in phrase_list)
    print phrases
    max_phrase_length = max(len(p) for p in phrases)

    # Find a phrase within words starting at the specified index.  Return the
    # phrase as a tuple, or None if no phrase starts at that index.
    def find_phrase(start_idx):
        # Iterate backwards, so we'll always find longer phrases before shorter ones.
        # Otherwise, if we have a phrase set like "hello world" and "hello world two",
        # we'll never match the longer phrase because we'll always match the shorter
        # one first.
        for phrase_length in xrange(max_phrase_length, 0, -1):
            test_word = tuple(words[idx:idx+phrase_length])
            if test_word in phrases:
                return test_word
        return None

    skip = 0
    for idx in xrange(len(words)):
        if skip:
            # This word was returned as part of a previous phrase; skip it.
            skip -= 1
            continue

        phrase = find_phrase(idx)
        if phrase is not None:
            skip = len(phrase)
            yield " ".join(phrase)
            continue

        yield words[idx]

print [s for s in split_with_phrases('i would like a blood orange',
    ["blood orange", "loan shark"])]

答案 1 :(得分:1)

啊,这是疯狂的,粗鲁的和丑陋的。但看起来它有效。你可能想要清理和优化它,但这里的某些想法可能有效。

list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]

for item in input_list:
    for str_lst in list_to_split:
        if item in str_lst:
            tmp = str_lst.split(item)
            lst = []
            for itm in tmp:
                if itm!= '':
                    lst.append(itm)
                    lst.append(item)
            print lst

输出:

['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']

答案 2 :(得分:1)

一种快速,肮脏,完全未优化的方法可能只是用包含不同分隔符的版本替换字符串中的化合物(最好是在目标字符串或复合词中不出现的其他分隔符)。然后拆分并更换。一种更有效的方法是只在字符串中迭代一次,在适当的地方匹配复合词 - 但是你可能需要注意有嵌套化合物等的实例,具体取决于你的数组。


#!/usr/bin/python
import re

my_string = "i would like a blood orange"
compounds = ["blood orange", "loan shark"]
for i in range(0,len(compounds)):
    my_string = my_string.replace(compounds[i],compounds[i].replace(" ","&"))

my_segs = re.split(r"\s+",my_string)    
for i in range(0,len(my_segs)):
    my_segs[i] = my_segs[i].replace("&"," ")
print my_segs

编辑:格伦梅纳德的解决方案更好。