我对Python比较陌生;我编写了以下代码,以查找相对于queries
中的索引的字符串中最接近的字符,并且我想找到一种优化代码的方法:
示例:
输入字符串:
s = 'adarshravi'
和queries = [2, 4]
(这些是要查找其重复项的字符的索引,并且输出应具有最接近的重复项的索引,如果没有重复的字符,则该输出将是- 1)
以上查询的输出将是:
[0, -1]
输出说明:
对于索引2,字符串中的字符为a
,字符串中还有另外两个a's
,一个在0
索引处,另一个在索引7
处,因此,两者之间最接近的位置是0'th
处的位置,索引4th
处的字符为s
,因此在字符串-1
中不会重复>
def closest(s, queries):
s = s.lower()
listIdx = []
for i in queries:
foundidx = []
srchChr = s[i]
for j in range(0, len(s)):
if s[j] == srchChr:
foundidx.append(j)
if len(foundidx) < 2:
listIdx.append(-1)
else:
lastIdx = -1
dist = 0
foundidx.remove(i)
for fnditem in foundidx:
if dist == 0:
lastIdx = fnditem
dist = abs(fnditem - i)
else:
if abs(fnditem - i) < dist:
lastIdx = fnditem
dist = abs(fnditem - i)
listIdx.append(lastIdx)
return listIdx
答案 0 :(得分:3)
我们可以构造一个索引列表,例如:
from itertools import zip_longest
def ranges(k, n):
for t in zip_longest(range(k-1, -1, -1), range(k+1, n)):
yield from filter(lambda x: x is not None, t)
因此,生成的索引如下:
>>> list(ranges(3, 10))
[2, 4, 1, 5, 0, 6, 7, 8, 9]
我们可以使用上面的命令检查最接近的字符:
def close(text, idx):
ci = text[idx]
return next(filter(lambda i: ci == text[i], ranges(idx, len(text))), -1)
然后产生:
>>> close('adarshravi', 0)
2
>>> close('adarshravi', 1)
-1
>>> close('adarshravi', 2)
0
>>> close('adarshravi', 3)
6
>>> close('adarshravi', 4)
-1
closest
就是close
函数在列表上的“映射”:
from functools import partial
def closest(text, indices):
return map(partial(close, text), indices)
例如:
>>> list(closest('adarshravi', range(5)))
[2, -1, 0, 6, -1]
答案 1 :(得分:2)
def closest_duplicates(s, queries):
result = []
for index in queries:
result.append(closest_duplicate(s, s[index], index))
return result
此人搜索单个项目
以下代码从2个索引开始:一个从左开始,另一个从右开始。我们不需要比字符串的长度多运行此循环-1.当它们到达末尾或首次找到字符时,我们返回索引。如果找不到,则返回-1
def closest_duplicate(s, letter, index):
min_distance = -1
for i in range(1, len(s)):
left_i = index - i
right_i = index + i
if left_i == -1 and right_i == len(s):
break
if left_i > -1 and s[left_i] == letter :
min_distance = left_i
break
if right_i < len(s) and s[right_i] == letter:
min_distance = right_i
break
return min_distance
测试在下面
if __name__ == '__main__':
s = 'adarshravi'
indexes = [2, 4]
result = closest_duplicates(s, indexes)
print(result)
batman = 'ilovebatmanandbatman'
indx = [1,2,5,6]
result = closest_duplicates(batman, indx)
print(result)
batman = 'iloveabatmanbatmanandbatman'
indx = [7]
result = closest_duplicates(batman, indx)
print(result)
答案 2 :(得分:1)
这将获取所有 感兴趣字符的索引,我们开始寻找最接近的匹配项。这样,我们就可以避免多余的计算,并且在字符仅出现一次或两次的情况下也可以进行简单的查找:
from collections import defaultdict
my_str = 'shroijsfrondhslmbs'
query = [4, 2, 11]
def closest_matches(in_str, query):
closest = []
character_positions = defaultdict(list)
valid_chars = {in_str[idx] for idx in query}
for i, character in enumerate(in_str):
if character not in valid_chars:
continue
character_positions[character].append(i)
for idx in query:
char = in_str[idx]
if len(character_positions[char]) is 1:
closest.append(-1)
continue
elif len(character_positions[char]) is 2:
closest.append(next(idx_i for idx_i in character_positions[char] if idx_i is not idx))
continue
shortest_dist = min(abs(idx_i - idx) for idx_i in character_positions[char] if idx_i is not idx)
closest_match = next(idx_i for idx_i in character_positions[char] if abs(idx_i - idx) == shortest_dist)
closest.append(closest_match)
return closest
closest_matches(my_str, query)
输出:[-1, 8, -1]
s = 'adarshravi'
queries = [2, 4]
closest_matches(s, queries)
输出:[0, -1]
一些时间:
%timeit closest_matches(my_str, query)
结果:8.98 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
与威廉的答案相比:
%timeit list(closest(my_str, query))
结果:55.8 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
与您的原始答案相比:
%timeit closest(my_str, query)
结果:11.4 µs ± 352 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
所以您已经做得不错!
答案 3 :(得分:1)
这通过创建带有索引的元组,然后比较两个索引之差的abs值(如果元组中的char相同)来工作。创建s_lst
时,queries
中的元组被忽略以避免与自身匹配
s = 'adarshravi'
queries = [2, 4]
queries = [(i, s[i]) for i in queries]
s_lst = [(i, v) for i, v in enumerate(s) if any(v in x for x in queries)]
s_lst = [i for i in s_lst if not any(i[0] in x for x in queries)]
res = []
for i in queries:
if not any(i[1] in x for x in s_lst):
res.append(-1)
else:
close = None
for j in s_lst:
if j[1] == i[1] and close == None:
close = j
elif abs(j[0] - i[0]) < abs(close[0] - i[0]):
close = j
res.append(close[0])
print(res)
# [0, -1]
答案 4 :(得分:0)
很可能我在下面找到了一个解决该问题的最佳方案,但是我想展示一下如果我被分配给这项任务,我将如何进行代码的优化。另外,我没有执行任何代码,因此您可能会发现一些语法错误。
================================================ ===========================
让我们说len(s) == n
和len(queries) == m
。
您当前的代码正在执行以下操作:
For each query, q:
1. find the character of the query, c
2. find the indices of other characters in the string that match c
3. find the closest index to the original index with the same character as the original index
步骤1-3执行m
次,因为有m
个查询。
第2步和第3步必须遍历整个字符串s
(在最坏的情况下,您的字符串s
由相同的字符组成),因此它要执行n
个步骤。
因此,您将为每个查询粗略执行2n + 1
个步骤,因此总体上,您将粗略执行(2n + 1) * m
个步骤。这(几乎)就是算法的runtime complexity。用big-O表示法,复杂度为:O(n*m)
。
让步骤2和3提取到自己的函数中:
def findIdxListByPos(s, i):
foundidx = []
srchChr = s[i]
for j in range(0, len(s)):
if s[j] == srchChr:
foundidx.append(j)
return foundIdx
def findClosestIndex(foundidx, i):
# this is not needed because if the character appeared only once,
# foundidx will be empty and the "for fnditem in foundidx" will not
# do anything, so you can remove it
if len(foundidx) < 2:
return -1
lastIdx = -1
dist = 0
foundidx.remove(i)
for fnditem in foundidx:
if dist == 0:
lastIdx = fnditem
dist = abs(fnditem - i)
else:
if abs(fnditem - i) < dist:
lastIdx = fnditem
dist = abs(fnditem - i)
return lastIdx
def closest(s, queries):
s = s.lower()
listIdx = []
for i in queries:
foundidx = findIdxListByPos(s, i)
lastIdx = findClosestIndex(foundidx, i)
listIdx.append(lastIdx)
return listIdx
您可以看到,在findIdxListByPos
中,您始终在查看字符串中的每个位置。
现在,假设您有一个queries = [1, 1]
的情况,那么您计算的是相同的foundidx
和相同的lastIdx
的两倍。我们可以保存该计算并重复使用。也就是说,您将foundidx
和lastIdx
保存在另一个变量中,这些变量在每次查询后都不会丢失。您可以在dictionary中以查询字符作为关键字来执行此操作。如果您已经计算出该密钥,则无需再次计算,只需重新使用它即可。
您的代码将如下所示:
def findIdxListByPos(s, i):
foundidx = []
srchChr = s[i]
for j in range(0, len(s)):
if s[j] == srchChr:
foundidx.append(j)
return foundIdx
def findClosestIndex(foundidx, i):
lastIdx = -1
dist = 0
foundidx.remove(i)
for fnditem in foundidx:
if dist == 0:
lastIdx = fnditem
dist = abs(fnditem - i)
else:
if abs(fnditem - i) < dist:
lastIdx = fnditem
dist = abs(fnditem - i)
return lastIdx
def calculateQueryResult(s, i, allFoundIdx):
srchChr = s[i]
if srchChr not in allFoundIdx:
allFoundIdx[srchChr] = findIdxListByPos(s, i)
foundidx = allFoundIdx[srchChr]
return findClosestIndex(foundidx, i)
def closest(s, queries):
s = s.lower()
listIdx = []
allFoundIdx = {}
queriesResults = {}
for i in queries:
if i not in queriesResults:
queriesResults[i] = calculateQueryResult(s, i, allFoundIdx)
listIdx.append(queriesResults[i])
return listIdx
此更改会增加算法使用的内存,并稍微改变其运行时复杂度。
现在,在最坏的情况下,您的查询中没有任何重复项。如果您没有重复的查询,该怎么办?您要查询s
中的每个元素,而s
中的所有元素都是不同的!
queries = [0,1,2,...,n]
所以len(queries) == n
,所以n == m
那么您的算法现在的复杂度为O(n*n) = O(n^2)
现在,您可以看到在最坏的情况下,您的allFoundIdx
词典将包含所有字符以及该字符串在所有位置的位置。因此,明智的存储方式相当于为字符串中的所有值预先计算该字典。先期计算所有内容不会提高运行时的复杂性,但也不会使运行情况变得更糟。
def findClosestIndex(foundidx, i):
lastIdx = -1
dist = 0
foundidx.remove(i)
for fnditem in foundidx:
if dist == 0:
lastIdx = fnditem
dist = abs(fnditem - i)
else:
if abs(fnditem - i) < dist:
lastIdx = fnditem
dist = abs(fnditem - i)
return lastIdx
def calculateAllFoundIdx(s):
allFoundIdx = {}
for i in range(0, len(s)):
srchChr = s[i]
# you should read about the get method of dictionaries. This will
# return an empty list if there is no value for the key srchChr in the
# dictionary
allFoundIdx[srchChr] = allFoundIdx.get(srchChr, []).append(i)
return allFoundIdx
def closest(s, queries):
s = s.lower()
listIdx = []
queriesResults = {}
# this has complexity O(n)
allFoundIdx = calculateAllFoundIdx(s)
# this still has complexity O(n^2) because findClosestIndex still has O(n)
# the for loop executes it n times
for i in queries:
if i not in queriesResults:
srchChr = s[i]
foundidx = allFoundIdx[srchChr]
queriesResults[i] = findClosestIndex(foundidx, i)
listIdx.append(queriesResults[i])
return listIdx
此算法仍然是O(n^2)
,但是现在您只需要优化findClosestIndex
函数,因为没有办法不对所有查询进行迭代。
因此,在findClosestIndex
中,您将获得一个数字列表(原始字符串中某个字符的位置),该列表以递增方式排序(由于其构造方式),而另一个数字您想要找到最接近的数字(此数字一定会包含在列表中)。
最接近的数字(因为列表已排序)必须是列表中的上一个或下一个。比这两个数字“更远”。
因此,基本上,您希望在列表中找到此数字的索引,然后在列表中找到上一个和下一个元素,比较它们的距离并返回最小的元素。
要在有序列表中查找数字,请使用binary search,而您只需要小心使用索引即可获得最终结果:
def binSearch(foundidx, idx):
hi = len(foundidx) - 1
lo = 0
while lo <= hi:
m = (hi + lo) / 2
if foundidx[m] < idx:
lo = m + 1
elif found[m] > idx:
hi = m - 1
else:
return m
# should never get here as we are sure the idx is in foundidx
return -1
def findClosestIndex(foundidx, idx):
if len(foundidx) == 1:
return -1
pos = binSearch(foundidx, idx)
if pos == 0:
return foundidx[pos + 1]
if pos == len(foundidx) - 1:
return foundidx[pos - 1]
prevDist = abs(foundidx[pos - 1] - idx)
postDist = abs(foundidx[pos + 1] - idx)
if prevDist <= postDist:
return pos - 1
return pos + 1
def calculateAllFoundIdx(s):
allFoundIdx = {}
for i in range(0, len(s)):
srchChr = s[i]
# you should read about the get method of dictionaries. This will
# return an empty array if there is no value for the key srchChr in the
# dictionary
allFoundIdx[srchChr] = allFoundIdx.get(srchChr, []).append(i)
return allFoundIdx
def closest(s, queries):
s = s.lower()
listIdx = []
queriesResults = {}
# this has complexity O(n)
allFoundIdx = calculateAllFoundIdx(s)
# this has now complexity O(n*log(n)) because findClosestIndex now has O(log(n))
for i in queries:
if i not in queriesResults:
srchChr = s[i]
foundidx = allFoundIdx[srchChr]
queriesResults[i] = findClosestIndex(foundidx, i)
listIdx.append(queriesResults[i])
return listIdx
现在findClosestIndex
的复杂度为O(log(n))
,因此closest
现在的复杂度为O(n*log(n))
。
现在最糟糕的情况是s
中的所有元素都与queries = [0, 1, ..., len(s) - 1]
答案 5 :(得分:-1)
s = 'adarshravi'
result = list()
indexes = [2, 4]
for index in indexes:
c = s[index]
back = index - 1
forward = index + 1
r = -1
while (back >= 0 or forward < len(s)):
if back >= 0 and c == s[back]:
r = back
break
if forward < len(s) and c == s[forward]:
r = forward
break
back -= 1
forward += 1
result.append(r)
print result