正则表达式:在一行中的字典中查找相同的单词

时间:2013-05-07 09:07:23

标签: regex dictionary editor

我在txt文件中有一个字典,格式如下:

house house$casa | casa, vivienda, hogar | edificio, casa | vivienda

$符号分隔翻译的期限。

我想找到在同一行出现多次的字典单词,通过带有文本编辑器的正则表达式,如Sublimetext,Notepad ++,...我不想要一个php函数,因为我必须手动检查如果我必须删除那些重复的话。在上面的例子中,正则表达式应该找到house,casa和vivienda。我的目标是获得以下结果:

house$casa | vivienda, hogar | edificio

我尝试使用以下表达式,但它无法正常工作:

(\b\w+\b)\W+\1

1 个答案:

答案 0 :(得分:0)

FWIW,这是一个如何在Python中执行此操作的粗略示例:

import re

def distinct_words(block, seen, delim):
    """ makes a list of words distinct, given a set of words seen earlier """

    unique_words = []

    for word in re.split(delim, block):
        if not word in seen:
            seen[word] = True
            unique_words.append(word)

    return unique_words

def process_line(line):
    """ removes all duplicate words from a dictionary line """

    # safeguard
    if '$' not in line: return line

    # split line at the '$'
    original, translated = line.split('$')

    # make original words distinct
    distinct_original = distinct_words(original, {}, r' +')

    # make translated words distinct, but keep block structure

    # split the translated part at '|' into blocks
    # split each block at ', ' into words
    seen = {}
    distinct_translated = [
        distinct_list for distinct_list in (
            distinct_words(block, seen, r', +') for block in (
                re.split(r'\s*\|\s*', translated)
            )
        )
        if len(distinct_list) > 0
    ]

    # put everything back together again
    part_original = ' '.join(distinct_original)
    part_translated = [', '.join(block) for block in distinct_translated]
    part_translated = ' | '.join(part_translated)
    result = part_original + '$' + part_translated

    return result

def process_dictionary(filename):
    """ processes a dictionary text file, modifies the file in place """

    lines = open(filename,'r').readlines()    
    lines_out = [process_line(line) for line in lines]
    contents_out = '\n'.join(lines_out)
    open(filename,'w').write(contents_out)

显然你会打电话给process_dictionary(),就像这样:

process_dictionary('dict_en_es.txt')

但是为了这个例子,假设你有一行:

line = "house house$casa | casa, vivienda, hogar | edificio, casa | vivienda"
line_out = process_line(line)
print line_out

打印想要的结果:

house$casa | vivienda, hogar | edificio