我有一个如下输入文件:
op.txt
user id query
4d67373f-ca45-4137-efd0-0da69c78123d , bookmy show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
7fda21a5-c432-4d95-f93d-6275b68bb396 , 8 gb pen drive
7fda21a5-c432-4d95-f93d-6275b68bb396 , 16 gb pen drive
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLATERS
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLAYERS
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLAYERS
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPAD
d900ec5f-bd71-4e2b-84d0-6a2105050923 , minoxidil
d900ec5f-bd71-4e2b-84d0-6a2105050923 , minoxidil 5
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia L
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia zr
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia zr
9b98a9be-bb63-4310-87d5-592a66ae602a , leggings
9b98a9be-bb63-4310-87d5-592a66ae602a , leggings
9b98a9be-bb63-4310-87d5-592a66ae602a , jeggings
83618338-70a0-4512-c763-0307fe5acef0 , woman jacket
83618338-70a0-4512-c763-0307fe5acef0 , woman jacket
83618338-70a0-4512-c763-0307fe5acef0 , man jacket
83618338-70a0-4512-c763-0307fe5acef0 , man jacket
从这里我发现输出如下:
dvd platers > dvd players
ipod > ipad
bookmy show > book my show
leggings > jeggings
woman jacket > man jacket
minoxidil > minoxidil 5
printed backcase for xperia l > printed backcase for xperia zr
8 gb pen drive > 16 gb pen drive
主要目的是找到所有特定用户的给定查询,并存储在列表中。从那里我需要找出所有查询的编辑距离。如果编辑距离小于2,那么我需要打印它。我的代码很好找到但它不应该检查任何数字更改,它只需要检查单词。例如,如果用户输入" 8 gb笔式驱动器"一段时间后,用户改变主意和类型" 16 gb笔式驱动器" 我不想打印它。
以下是我的代码:
def min_edit_dist(s1, s2):
m=len(s1)+1
n=len(s2)+1
tbl = {}
for i in range(m): tbl[i,0]=i
for j in range(n): tbl[0,j]=j
for i in range(1, m):
for j in range(1, n):
cost = 0 if s1[i-1] == s2[j-1] else 1
tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
return tbl[i,j]
with open("op.txt") as text:
d = {}
for line in text:
line = line.strip("\n")
for lines in line.split("\n"):
try:
key, val = lines.split(",")
d.setdefault(key,[]).append(val.lower())
except:
pass
values = d.values()
keys = d.keys()
for v in values:
for i in range(0,len(v)-1):
if v[i]!= v[i+1]:
if min_edit_dist(v[i], v[i+1]) <= 2:
print v[i]+" > "+v[i+1]
我只需要输出如下:
dvd platers > dvd players
ipod > ipad
bookmy show > book my show
leggings > jeggings
woman jacket > man jacket
printed backcase for xperia l > printed backcase for xperia zr
答案 0 :(得分:1)
您需要在
处过滤val
的值
key, val = lines.split(",")
d.setdefault(key,[]).append(val.lower())
要过滤掉字符串中的数字,请尝试
key, val = lines.split(",")
val = ''.join(letter for letter in val if not letter.isdigit()) # filter out digit chars
d.setdefault(key,[]).append(val.lower())
这将使提取的每个val
字符串的执行列表理解成为可能,并加入所有已过滤的字符。不是一个非常有效的解决方案,但应该符合您的需求。