正则表达式,用于匹配重复的字符串

时间:2019-04-24 23:30:34

标签: python regex python-3.x

使用正则表达式搜索产品名称和产品描述crom CSV过滤电压,我打算做的是从搜索中删除重复的值。 我已经尝试过设置列表等,我正在努力理解为什么我不能从搜索中删除重复的单词。不理解set的工作原理,似乎将所有值分割为1,2,,v,o,l,t个字符,不仅可以删除找到的整个重复单词吗?当我运行代码时,我得到:

12 Volt
12 Volt
40 Volt
2 Volt
18 Volt
18 Volt
240 Volt
240 Volt
110 Volt
110 Volt
110 Volt
36 Volt

我需要努力奋斗的是独特的值列表,例如12伏,40伏,18伏,240伏等,等等

def volts_search():
    with open('filters/volts_filter.csv', 'w') as headerOut:
        headerOut.write("name" + "," + "sort_order" + "," + "status" + "," + "image" + "," + "regex" + "," + "value" + "\n")

    with open(merchant_feed, 'r') as csv_filein, open('filters/volts_filter.csv', 'a') as fileOut:
        reader = csv.DictReader(csv_filein, delimiter=',', quotechar='"')
        for row in reader:
            program_name = clean_text(row['program_name'])
            product_name = clean_text(row['product_name'])
            product_description = clean_text(row['description'])
            merchant_category = clean_text(row['merchant_category'])
            product_id = row['product_id']
            product_brand = clean_text(row['brand'])

            filter_name = "Filter By Volts:"
            v = re.findall(r"((?i)(?:)\d+\.\d+v|\d+\.\d+ v|\d+ v|\d+v)", product_name + product_description)

            volt = re.findall(r"((?i)(?:)\d+volt|\d+ volt)", product_name + product_description)

            volts = re.findall(r"((?i)(?:)\d+\.\d+volts|\d+volts)", product_name + product_description)

            seen = set()
            for filter_search in volt:
                if filter_search in product_name + product_description:
                    if filter_search in seen: continue
                    seen.add(filter_search)

                    print(filter_search)

1 个答案:

答案 0 :(得分:0)

RegEx

This expression可能会帮助您使用字符串替换来删除CSV文件中的重复输入:

([\s\S]+)\1{1,} 

enter image description here

此图显示了如何通过反向引用来工作:

enter image description here