忽略字符串中的数字(数字)

时间:2015-02-04 09:13:59

标签: python

我有一个如下输入文件:

op.txt

          user id                        query
4d67373f-ca45-4137-efd0-0da69c78123d , bookmy show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
7fda21a5-c432-4d95-f93d-6275b68bb396 , 8 gb pen drive
7fda21a5-c432-4d95-f93d-6275b68bb396 , 16 gb pen drive
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLATERS
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLAYERS
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLAYERS
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPAD
d900ec5f-bd71-4e2b-84d0-6a2105050923 , minoxidil
d900ec5f-bd71-4e2b-84d0-6a2105050923 , minoxidil 5
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia L
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia zr
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia zr
9b98a9be-bb63-4310-87d5-592a66ae602a , leggings
9b98a9be-bb63-4310-87d5-592a66ae602a , leggings
9b98a9be-bb63-4310-87d5-592a66ae602a , jeggings
83618338-70a0-4512-c763-0307fe5acef0 , woman jacket
83618338-70a0-4512-c763-0307fe5acef0 , woman jacket
83618338-70a0-4512-c763-0307fe5acef0 , man jacket
83618338-70a0-4512-c763-0307fe5acef0 , man jacket

从这里我发现输出如下:

dvd platers >  dvd players
ipod >  ipad
bookmy show >  book my show
leggings >  jeggings
woman jacket >  man jacket
minoxidil >  minoxidil 5
printed backcase for xperia l >  printed backcase for xperia zr
8 gb pen drive >  16 gb pen drive

主要目的是找到所有特定用户的给定查询,并存储在列表中。从那里我需要找出所有查询的编辑距离。如果编辑距离小于2,那么我需要打印它。我的代码很好找到但它不应该检查任何数字更改,它只需要检查单词。例如,如果用户输入" 8 gb笔式驱动器"一段时间后,用户改变主意和类型" 16 gb笔式驱动器" 我不想打印它。

以下是我的代码:

 def min_edit_dist(s1, s2):
    m=len(s1)+1
    n=len(s2)+1
    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
    return tbl[i,j]
    with open("op.txt") as text:
       d = {}
       for line in text:
          line = line.strip("\n")
          for lines in line.split("\n"):
            try:
                key, val = lines.split(",")
                d.setdefault(key,[]).append(val.lower())
            except:
                pass
    values = d.values()
    keys = d.keys()
    for v in values:
        for i in range(0,len(v)-1):
           if v[i]!= v[i+1]:
              if min_edit_dist(v[i], v[i+1]) <= 2:
                  print v[i]+" > "+v[i+1]

我只需要输出如下:

dvd platers >  dvd players
ipod >  ipad
bookmy show >  book my show
leggings >  jeggings
woman jacket >  man jacket
printed backcase for xperia l >  printed backcase for xperia zr

1 个答案:

答案 0 :(得分:1)

您需要在

处过滤val的值
key, val = lines.split(",")
d.setdefault(key,[]).append(val.lower())

要过滤掉字符串中的数字,请尝试

key, val = lines.split(",")
val = ''.join(letter for letter in val if not letter.isdigit())  # filter out digit chars
d.setdefault(key,[]).append(val.lower())

这将使提取的每个val字符串的执行列表理解成为可能,并加入所有已过滤的字符。不是一个非常有效的解决方案,但应该符合您的需求。