删除重复的单词,但保留句子中的重复数字

时间:2019-01-18 14:12:43

标签: python regex

我正在尝试找出如何删除许多句子中的重复单词,但又不删除一位或两位数数字的方法。

我以前使用以下命令删除重复项,同时保留了顺序,但这删除了重复的数字。

df['reporting_name'] = df['reporting_name'].str.split().apply(lambda x: OrderedDict.fromkeys(x).keys() if x is not None else None).str.join(' ')

所以我想我需要一些正则表达式来拆分一个单词,后跟一个数字(包括空格),例如this。 也许还有另一种通用解决方案。

输入

"East Zone Mbc26 East Zone 1 2nd S11B Smds Smoke Damper 1 Status"
"GF Command Room 1 Unit 1 Flow Temperature Temperature"

预期产量

"East Zone Mbc26 Zone 1 2nd S11B Smds Smoke Damper 1 Status"
"GF Command Room 1 Unit 1 Flow Temperature"

删除重复的单词,保留数字并保持单词的顺序。

当单词具有标识符并且是重复单词时,例如“ Zone 1”,则同时保留“ Zone”和“ Zone 1”。

1 个答案:

答案 0 :(得分:1)

This should do the trick if you want to keep the first occurence of every non digit word. You can always trick the condition to force having max two digits.

def cleanup(s):
    words = set()
    for (word, nextword) in zip(s.split(), s.split()[1:] + [None]):
        if word.isdigit():
            yield word
            continue
        if not word in words:
            words.add(word)
            yield word
        elif nextword and nextword.isdigit():
            yield word


print ' '.join(cleanup("East Zone Mbc26 East Zone 1 2nd S11B Smds Smoke Damper 1 Status"))
print ' '.join(cleanup("GF Command Room 1 Unit 1 Flow Temperature Temperature"))

Output

East Zone Mbc26 Zone 1 2nd S11B Smds Smoke Damper 1 Status
GF Command Room 1 Unit 1 Flow Temperature