Question

我对Python比较陌生；我编写了以下代码，以查找相对于queries中的索引的字符串中最接近的字符，并且我想找到一种优化代码的方法：

示例：

输入字符串： s = 'adarshravi'

和queries = [2, 4]（这些是要查找其重复项的字符的索引，并且输出应具有最接近的重复项的索引，如果没有重复的字符，则该输出将是- 1）

以上查询的输出将是： [0, -1]

输出说明：

对于索引2，字符串中的字符为a，字符串中还有另外两个a's，一个在0索引处，另一个在索引7处，因此，两者之间最接近的位置是0'th处的位置，索引4th处的字符为s，因此在字符串-1中不会重复

def closest(s, queries):

    s = s.lower()
    listIdx = []

    for i in queries:
        foundidx = []
        srchChr = s[i]

        for j in range(0, len(s)):
            if s[j] == srchChr:
                foundidx.append(j)

        if len(foundidx) < 2:
            listIdx.append(-1)
        else:
            lastIdx = -1
            dist = 0
            foundidx.remove(i)
            for fnditem in foundidx:
                if dist == 0:
                    lastIdx = fnditem
                    dist = abs(fnditem - i)
                else:
                    if abs(fnditem - i) < dist:
                        lastIdx = fnditem
                        dist = abs(fnditem - i)
            listIdx.append(lastIdx)
    return listIdx

Answer 1

我们可以构造一个索引列表，例如：

from itertools import zip_longest

def ranges(k, n):
    for t in zip_longest(range(k-1, -1, -1), range(k+1, n)):
        yield from filter(lambda x: x is not None, t)

因此，生成的索引如下：

>>> list(ranges(3, 10))
[2, 4, 1, 5, 0, 6, 7, 8, 9]

我们可以使用上面的命令检查最接近的字符：

def close(text, idx):
    ci = text[idx]
    return next(filter(lambda i: ci == text[i], ranges(idx, len(text))), -1)

然后产生：

>>> close('adarshravi', 0)
2
>>> close('adarshravi', 1)
-1
>>> close('adarshravi', 2)
0
>>> close('adarshravi', 3)
6
>>> close('adarshravi', 4)
-1

closest就是close函数在列表上的“映射”：

from functools import partial

def closest(text, indices):
    return map(partial(close, text), indices)

例如：

>>> list(closest('adarshravi', range(5)))
[2, -1, 0, 6, -1]

Answer 2

def closest_duplicates(s, queries):
    result = []
    for index in queries:
        result.append(closest_duplicate(s, s[index], index))
    return result

此人搜索单个项目

以下代码从2个索引开始：一个从左开始，另一个从右开始。我们不需要比字符串的长度多运行此循环-1.当它们到达末尾或首次找到字符时，我们返回索引。如果找不到，则返回-1

def closest_duplicate(s, letter, index):
    min_distance = -1
    for i in range(1, len(s)):
        left_i = index - i
        right_i = index + i
        if left_i == -1 and right_i == len(s):
            break

        if left_i > -1 and s[left_i] == letter :
            min_distance = left_i
            break
        if right_i < len(s) and s[right_i] == letter:
            min_distance = right_i
            break
    return min_distance

测试在下面

if __name__ == '__main__':
    s = 'adarshravi'
    indexes = [2, 4]
    result = closest_duplicates(s, indexes)
    print(result)
    batman = 'ilovebatmanandbatman'
    indx = [1,2,5,6]
    result = closest_duplicates(batman, indx)
    print(result)
    batman = 'iloveabatmanbatmanandbatman'
    indx = [7]
    result = closest_duplicates(batman, indx)
    print(result)

Answer 3

这将获取所有感兴趣字符的索引，我们开始寻找最接近的匹配项。这样，我们就可以避免多余的计算，并且在字符仅出现一次或两次的情况下也可以进行简单的查找：

from collections import defaultdict
my_str = 'shroijsfrondhslmbs'
query = [4, 2, 11]

def closest_matches(in_str, query):
    closest = []
    character_positions = defaultdict(list)
    valid_chars = {in_str[idx] for idx in query}
    for i, character in enumerate(in_str):
        if character not in valid_chars:
            continue
        character_positions[character].append(i)
    for idx in query:
        char = in_str[idx]
        if len(character_positions[char]) is 1:
            closest.append(-1)
            continue
        elif len(character_positions[char]) is 2:
            closest.append(next(idx_i for idx_i in character_positions[char] if idx_i is not idx))
            continue
        shortest_dist = min(abs(idx_i - idx) for idx_i in character_positions[char] if idx_i is not idx)
        closest_match = next(idx_i for idx_i in character_positions[char] if abs(idx_i - idx) == shortest_dist)
        closest.append(closest_match)
    return closest

closest_matches(my_str, query)

输出：[-1, 8, -1]

s = 'adarshravi'
queries = [2, 4]
closest_matches(s, queries)

输出：[0, -1]

一些时间：

%timeit closest_matches(my_str, query)

结果：8.98 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

与威廉的答案相比：

%timeit list(closest(my_str, query))

结果：55.8 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

与您的原始答案相比：

%timeit closest(my_str, query)

结果：11.4 µs ± 352 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

所以您已经做得不错！

Answer 4

这通过创建带有索引的元组，然后比较两个索引之差的abs值（如果元组中的char相同）来工作。创建s_lst时，queries中的元组被忽略以避免与自身匹配

s = 'adarshravi'
queries = [2, 4]
queries = [(i, s[i]) for i in queries]

s_lst = [(i, v) for i, v in enumerate(s) if any(v in x for x in queries)]
s_lst = [i for i in s_lst if not any(i[0] in x for x in queries)]

res = []
for i in queries:
    if not any(i[1] in x for x in s_lst):
        res.append(-1)
    else:
        close = None
        for j in s_lst:
            if j[1] == i[1] and close == None:
                close = j
            elif abs(j[0] - i[0]) < abs(close[0] - i[0]):
                close = j
        res.append(close[0])

print(res)
# [0, -1]

Answer 5

很可能我在下面找到了一个解决该问题的最佳方案，但是我想展示一下如果我被分配给这项任务，我将如何进行代码的优化。另外，我没有执行任何代码，因此您可能会发现一些语法错误。

================================================ ===========================

让我们说len(s) == n和len(queries) == m。

您当前的代码正在执行以下操作：

For each query, q:
  1. find the character of the query, c
  2. find the indices of other characters in the string that match c
  3. find the closest index to the original index with the same character as the original index

步骤1-3执行m次，因为有m个查询。第2步和第3步必须遍历整个字符串s（在最坏的情况下，您的字符串s由相同的字符组成），因此它要执行n个步骤。

因此，您将为每个查询粗略执行2n + 1个步骤，因此总体上，您将粗略执行(2n + 1) * m个步骤。这（几乎）就是算法的runtime complexity。用big-O表示法，复杂度为：O(n*m)。

让步骤2和3提取到自己的函数中：

def findIdxListByPos(s, i):
  foundidx = []
  srchChr = s[i]

  for j in range(0, len(s)):
      if s[j] == srchChr:
        foundidx.append(j)

  return foundIdx

def findClosestIndex(foundidx, i):
  # this is not needed because if the character appeared only once,
  # foundidx will be empty and the "for fnditem in foundidx" will not
  # do anything, so you can remove it
  if len(foundidx) < 2:
      return -1

  lastIdx = -1
  dist = 0
  foundidx.remove(i)

  for fnditem in foundidx:
    if dist == 0:
      lastIdx = fnditem
      dist = abs(fnditem - i)
    else:
      if abs(fnditem - i) < dist:
        lastIdx = fnditem
        dist = abs(fnditem - i)

  return lastIdx

def closest(s, queries):
  s = s.lower()
  listIdx = []

  for i in queries:
    foundidx = findIdxListByPos(s, i)
    lastIdx = findClosestIndex(foundidx, i)

    listIdx.append(lastIdx)

  return listIdx

您可以看到，在findIdxListByPos中，您始终在查看字符串中的每个位置。

现在，假设您有一个queries = [1, 1]的情况，那么您计算的是相同的foundidx和相同的lastIdx的两倍。我们可以保存该计算并重复使用。也就是说，您将foundidx和lastIdx保存在另一个变量中，这些变量在每次查询后都不会丢失。您可以在dictionary中以查询字符作为关键字来执行此操作。如果您已经计算出该密钥，则无需再次计算，只需重新使用它即可。

您的代码将如下所示：

def findIdxListByPos(s, i):
  foundidx = []
  srchChr = s[i]

  for j in range(0, len(s)):
      if s[j] == srchChr:
        foundidx.append(j)

  return foundIdx

def findClosestIndex(foundidx, i):
  lastIdx = -1
  dist = 0
  foundidx.remove(i)

  for fnditem in foundidx:
    if dist == 0:
      lastIdx = fnditem
      dist = abs(fnditem - i)
    else:
      if abs(fnditem - i) < dist:
        lastIdx = fnditem
        dist = abs(fnditem - i)

  return lastIdx

def calculateQueryResult(s, i, allFoundIdx):
  srchChr = s[i]
  if srchChr not in allFoundIdx:
    allFoundIdx[srchChr] = findIdxListByPos(s, i)

  foundidx = allFoundIdx[srchChr]

  return findClosestIndex(foundidx, i)

def closest(s, queries):
  s = s.lower()
  listIdx = []
  allFoundIdx = {}
  queriesResults = {}

  for i in queries:
    if i not in queriesResults:
      queriesResults[i] = calculateQueryResult(s, i, allFoundIdx)

    listIdx.append(queriesResults[i])

return listIdx

此更改会增加算法使用的内存，并稍微改变其运行时复杂度。

现在，在最坏的情况下，您的查询中没有任何重复项。如果您没有重复的查询，该怎么办？您要查询s中的每个元素，而s中的所有元素都是不同的！

queries = [0,1,2,...,n]所以len(queries) == n，所以n == m那么您的算法现在的复杂度为O(n*n) = O(n^2)

现在，您可以看到在最坏的情况下，您的allFoundIdx词典将包含所有字符以及该字符串在所有位置的位置。因此，明智的存储方式相当于为字符串中的所有值预先计算该字典。先期计算所有内容不会提高运行时的复杂性，但也不会使运行情况变得更糟。

def findClosestIndex(foundidx, i):
  lastIdx = -1
  dist = 0
  foundidx.remove(i)

  for fnditem in foundidx:
    if dist == 0:
      lastIdx = fnditem
      dist = abs(fnditem - i)
    else:
      if abs(fnditem - i) < dist:
        lastIdx = fnditem
        dist = abs(fnditem - i)

  return lastIdx

def calculateAllFoundIdx(s):
  allFoundIdx = {}
  for i in range(0, len(s)):
    srchChr = s[i]

    # you should read about the get method of dictionaries. This will 
    # return an empty list if there is no value for the key srchChr in the
    # dictionary 
    allFoundIdx[srchChr] = allFoundIdx.get(srchChr, []).append(i)

  return allFoundIdx

def closest(s, queries):
  s = s.lower()
  listIdx = []
  queriesResults = {}

  # this has complexity O(n)
  allFoundIdx = calculateAllFoundIdx(s)

  # this still has complexity O(n^2) because findClosestIndex still has O(n)
  # the for loop executes it n times
  for i in queries:
    if i not in queriesResults:
      srchChr = s[i]
      foundidx = allFoundIdx[srchChr]
      queriesResults[i] = findClosestIndex(foundidx, i)

    listIdx.append(queriesResults[i])

return listIdx

此算法仍然是O(n^2)，但是现在您只需要优化findClosestIndex函数，因为没有办法不对所有查询进行迭代。

因此，在findClosestIndex中，您将获得一个数字列表（原始字符串中某个字符的位置），该列表以递增方式排序（由于其构造方式），而另一个数字您想要找到最接近的数字（此数字一定会包含在列表中）。

最接近的数字（因为列表已排序）必须是列表中的上一个或下一个。比这两个数字“更远”。

因此，基本上，您希望在列表中找到此数字的索引，然后在列表中找到上一个和下一个元素，比较它们的距离并返回最小的元素。

要在有序列表中查找数字，请使用binary search，而您只需要小心使用索引即可获得最终结果：

def binSearch(foundidx, idx):
  hi = len(foundidx) - 1
  lo = 0

  while lo <= hi:
    m = (hi + lo) / 2
    if foundidx[m] < idx:
      lo = m + 1
    elif found[m] > idx:
      hi = m - 1
    else:
      return m

 # should never get here as we are sure the idx is in foundidx
 return -1 

def findClosestIndex(foundidx, idx):
  if len(foundidx) == 1:
    return -1

  pos = binSearch(foundidx, idx)

  if pos == 0:
    return foundidx[pos + 1]

  if pos == len(foundidx) - 1:
    return foundidx[pos - 1]

  prevDist = abs(foundidx[pos - 1] - idx)
  postDist = abs(foundidx[pos + 1] - idx)

  if prevDist <= postDist:
    return pos - 1

  return pos + 1

def calculateAllFoundIdx(s):
  allFoundIdx = {}
  for i in range(0, len(s)):
    srchChr = s[i]

    # you should read about the get method of dictionaries. This will 
    # return an empty array if there is no value for the key srchChr in the
    # dictionary 
    allFoundIdx[srchChr] = allFoundIdx.get(srchChr, []).append(i)

  return allFoundIdx

def closest(s, queries):
  s = s.lower()
  listIdx = []
  queriesResults = {}

  # this has complexity O(n)
  allFoundIdx = calculateAllFoundIdx(s)

  # this has now complexity O(n*log(n)) because findClosestIndex now has O(log(n))
  for i in queries:
    if i not in queriesResults:
      srchChr = s[i]
      foundidx = allFoundIdx[srchChr]
      queriesResults[i] = findClosestIndex(foundidx, i)

    listIdx.append(queriesResults[i])

  return listIdx

现在findClosestIndex的复杂度为O(log(n))，因此closest现在的复杂度为O(n*log(n))。

现在最糟糕的情况是s中的所有元素都与queries = [0, 1, ..., len(s) - 1]

相同

Answer 6

s = 'adarshravi'
result = list()
indexes = [2, 4]
for index in indexes:
    c = s[index]
    back = index - 1
    forward = index + 1
    r = -1
    while (back >= 0 or forward < len(s)):
        if back >= 0 and c == s[back]:
            r = back
            break
        if forward < len(s) and c == s[forward]:
            r = forward
            break
        back -= 1
        forward += 1
    result.append(r)

print result

使用python的字符串中最接近的字符

6 个答案: