比较列表的两个列表并在python中获得部分/完全匹配

时间:2019-11-16 04:34:20

标签: python regex list pattern-matching text-extraction

我有一个csv和一个文本文件。 csv有3列。 CSV数据示例:

Pack Type   Component Type        Component Material
Blister       Foil                    Aluminium
Blister      Base Web                 PVC/PVDC
Bottle     Cylindrically Bottles    
Bottle       Screw Type Cap         Polypropylene 

示例文本数据:

The tablets are filled into cylindrically shaped bottles made of white coloured
polyethylene. The volumes of the bottles depend on the tablet strength and amount of
tablets, ranging from 20 to 175 ml. The screw type cap is made of white coloured
polypropylene and is equipped with a tamper proof ring.

我有两个列表。列表1来自csv,列表2来自文本文件。

list 1 = [['Bottle', 'Screw Type Cap', 'Polypropylene'], ['Bottle', 'Safety Ring', ''], ['Blister', 'Base Web', 'PVC'], ['Blister', 'Base Web', 'PVD/PVDC'], ['Bottle', 'Square Shaped Bottle', 'Polyethylene'], ['Bottle', 'Child Resistant (CR) Cap', 'Polypropylene']]

list 2 = [['The', 'tablets', 'are', 'filled', 'into', 'cylindrically', 'shaped', 'bottles', 'made', 'of', 'white', 'coloured', 'polyethylene.', 'The', 'volumes', 'of', 'the', 'bottles', 'depend', 'on', 'the', 'tablet', 'strength', 'and', 'amount', 'of', 'tablets,', 'ranging', 'from', '20', 'to', '175', 'ml.', 'The', 'screw', 'type', 'cap', 'is', 'made', 'of', 'white', 'coloured', 'polypropylene', 'and', 'is', 'equipped', 'with', 'a', 'tamper', 'proof', 'ring.'], ['PVC/PVDC', 'blister', 'pack'], ['Blisters', 'are', 'made', 'in', 'a', 'thermo-forming', 'process', 'from', 'a', 'PVC/PVDC', 'base', 'web.', 'Each', 'tablet', 'is', 'filled', 'into', 'a', 'separate', 'blister', 'and', 'a', 'lidding', 'foil', 'of', 'aluminium', 'is', 'welded', 'on.', 'The', 'blisters', 'are', 'opened', 'by', 'pressing', 'the', 'tablets', 'through', 'the', 'lidding', 'foil.', 'PVDC', 'foil', 'is', 'in', 'contact', 'with', 'the', 'tablets.']]

我想在列表2的每个列表中搜索每个list1字符串。因此,我试图将列表1的标记与list2进行匹配。如果在列表2的列表内找到列表1的一个列表的所有标记,则应返回一个匹配项,我要在列表2中标识具有列表1的特定列表的所有标记的列表,然后返回整个列表匹配列表2列表,以及列表1的匹配列表。

Output expected:

paragraph: ['Blisters', 'are', 'made', 'in', 'a', 'thermo-forming', 'process', 'from', 'a', 'PVC/PVDC', 'base', 'web.', 'Each', 'tablet', 'is', 'filled', 'into', 'a', 'separate', 'blister', 'and', 'a', 'lidding', 'foil', 'of', 'aluminium', 'is', 'welded', 'on.', 'The', 'blisters', 'are', 'opened', 'by', 'pressing', 'the', 'tablets', 'through', 'the', 'lidding', 'foil.', 'PVDC', 'foil', 'is', 'in', 'contact', 'with', 'the', 'tablets.'], Pack Type: Blister, Component Type: 'Base Web', Component Material:  'PVD/PVDC'

问题:

Q1. How do I make 'Base Web' match with 'base', 'web.' in list 2
Q2. **** In the CSV, the 3rd row has no data in the 3rd column. If such a case is encountered, I want to ignore the empty 3rd column and match the remaining value from 2 columns.
Q3. I want partial matches to be extracted likewise

到目前为止的代码:

import re,csv
filepath = r'C:\Users\0903882.txt'

with open(filepath) as f:
    data=f.read()
    paragraphs=data.split("\n\n")
    #print(paragraphs)

all_words=[]
for paragraph in paragraphs:
    words=paragraph.split()
    all_words.append(words)

print(all_words)


inputfile = r"C:\Users\metadata.csv"                
inputm = []

with open(inputfile, "r") as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        #types = row.split(',')
        inputm.append(row)

final_ref = [] 
for lists in inputm:
    final_ref.append(str(lists[0]).split(','))

print(final_ref)

这使我可以比较两个列表

0 个答案:

没有答案