Question

我在这个问题上遇到了障碍，我想知道是否有一些新鲜的大脑可以帮助我。

我有四个元素元组的大型列表：

（身份证号码，类型，起始索引，结束索引）

以前在代码中，我搜索了数千个文本块，用于两种特定类型的子串。这些元组存储了大块文本的子字符串，它是两种类型的子字符串中的哪一种，以及该子字符串的开始和结束索引。

最终目标是查看此列表以查找在具有相同ID的文本块中的类型2子字符串之前发生类型1子字符串的所有实例。然后我想以格式（ID，Type 1，Start，End，Type2，Start，End）存储这些对象。

我试图搞砸一堆效率极低的东西。我按照ID然后开始索引排序列表，如果尝试不同的方式从列表中弹出项目进行比较。我不得不想象有一个更优雅的解决方案。任何有才华的人都希望能帮助我疲惫的大脑吗？

提前致谢

Answer 1

<强>解决方案：

result = [(l1 + l2[1:]) 
          for l1 in list1 
          for l2 in list2 
          if (l1[0] == l2[0] and l1[3] < l2[2])
          ]

...带有测试代码：

list1 = [(1, 'Type1', 20, 30,),
         (2, 'Type1', 20, 30,),
         (3, 'Type1', 20, 30,),
         (4, 'Type1', 20, 30,),
         (5, 'Type1', 20, 30,),
         (6, 'Type1', 20, 30,), # does not have Type2

         (8, 'Type1', 20, 30,), # multiple
         (8, 'Type1', 25, 35,), # multiple
         (8, 'Type1', 50, 55,), # multiple
         ]

list2 = [(1, 'Type2', 40, 50,), # after
         (2, 'Type2', 10, 15,), # before
         (3, 'Type2', 25, 28,), # inside
         (4, 'Type2', 25, 35,), # inside-after
         (4, 'Type2', 15, 25,), # inside-before
         (7, 'Type2', 20, 30,), # does not have Type1

         (8, 'Type2', 40, 50,), # multiple
         (8, 'Type2', 60, 70,), # multiple
         (8, 'Type2', 80, 90,), # multiple
         ]

result = [(l1 + l2[1:]) 
          for l1 in list1 
          for l2 in list2 
          if (l1[0] == l2[0] and l1[3] < l2[2])
          ]

print '\n'.join(str(r) for r in result)

如果在同一文本ID中出现多次出现Type1和Type2，则不清楚您希望得到什么结果。请明确说明。

Answer 2

我不知道你有多少种类型。但是如果我们假设你只有类型1和类型2，那么它听起来像一个类似于合并排序的问题。使用合并排序执行此操作，您只需在列表中进行一次传递。

取两个索引，一个用于类型1，一个用于类型2（I1，I2）。按id，start1对列表进行排序。将I1作为type1的第一个实例，将I2作为零启动。如果I1.id＆lt; I2.Id然后增加I1。如果I2.id＆lt; I1.id然后增加I2。如果I1.id = I2.id则检查iStart。

I1只能在类型1记录上停止，而I2只能在类型2记录上停止。继续增加指数，直到它落在适当的记录上。

你可以做一些假设来加快速度。当您找到成功的块时，可以将I1移动到下一个块。每当I2＆lt; I1，你可以在I1 + 1开始I2（WOOPS确认你不要这样做，因为你会错过失败的情况！）每当你发现一个明显的失败案例时，将I1和I2移动到下一个区块（在适当的recs当然）。

Answer 3

我最近做过这样的事情。我可能不理解你的问题，但是这里也是。

我会用字典：

from collections import defaultdict:
masterdictType1=defaultDict(dict)
masterdictType2=defaultdict(dict)


for item in myList:
   if item[1]=Type1
       if item[0] not in masterdictType1:
           masterdictType1[item[0]]['begin']=item[2] # start index
           masterdictType1[item[0]]['end']=item[-1] # end index
   if item[1]=Type2
       if item[0] not in masterdictType2:
           masterdictType2[item[0]]['begin']=item[2] # start index
           masterdictType2[item[0]]['end']=item[-1] # end index

joinedDict=defaultdict(dict)

for id in masterdictType1:
    if id in masterdictType2:
        if masterdictType1[id]['begin']<masterdictType2[id]['begin']:
            joinedDict[id]['Type1Begin']=masterdictType1[id]['begin']
            joinedDict[id]['Type1End']=masterdictType1[id]['end']
            joinedDict[id]['Type2Begin']=masterdictType2[id]['begin']
            joinedDict[id]['Type2End']=masterdictType2[id]['end']

这为您提供了明确性，并为您提供了耐用的东西，因为您可以轻松地挑选字典。

Answer 4

假设每个ID都有很多条目，我会（伪代码）

    for each ID:
        for each type2 substring of that ID:
            store it in an ordered list, sorted by start point
        for each type1 substring of that ID:
            calculate the end point (or whatever)
            look it up in the ordered list
            if there's anything to the right, you have a hit

因此，如果您可以控制初始排序，那么您需要按ID排序，然后按类型（2之前的1）排序（ID，start）。然后在类型中，按类型2的起点排序，以及要比较type1的偏移量。我不确定“B之前的A”是指“A在B开始之前开始”还是“A在B开始之前结束”，但是做任何合适的事情。

然后你可以通过在列表上运行一次来完成整个操作。您不需要实际构造type2s的索引，因为它们已经按顺序排列。由于type1s也是排序的，因此您可以从上一次搜索的结果开始，使用线性或二进制搜索进行每次查找。如果有很多type1s与type2s相比（使得结果靠得很近），则使用线性搜索;如果有很多type2s与type1s相比，则使用二进制搜索（因此结果是稀疏的）。或者只是坚持使用线性搜索，因为它更简单 - 这个查找是内部循环，但它的性能可能并不重要。

如果您无法控制排序，那么我不知道为每个ID构建type2子串列表是否更快;或者在开始进入所需订单之前对整个清单进行排序;或者只是为了处理你所拥有的东西，通过编写一个“查找”，在搜索type2s时已经忽略了type1条目（已经根据需要进行了排序）。测试它，或者只是做一些更清晰的代码。即使没有重新排序，您仍然可以使用合并样式优化，除非“按起始索引排序”对于type1来说是错误的。

Answer 5

我可以通过在之前检查，你的意思是立即吗（即。t1_a, t2_b, t2_c, t2_d应该只给对(t1_a, t2_b)，或者做你想要在同一个块中的type2之前发生type1值 where 的所有对。（例如前一个例子的(t1_a, t2_b), (t1_a, t2_c), (t1_a, t2_d)）。

在任何一种情况下，你应该能够通过列表上的一次传递来做到这一点（假设按id排序，然后开始索引）。

这是一个假设第二个选项（每对）的解决方案：

import itertools, operator

def find_t1_t2(seq):
    """Find every pair of type1, type2 values where the type1 occurs 
    before the type2 within a block with the same id.

    Assumes sequence is ordered by id, then start location.
    Generates a sequence of tuples of the type1,type2 entries.
    """
    for group, items in itertools.groupby(seq, operator.itemgetter(0)):
        type1s=[]
        for item in items:
            if item[1] == TYPE1: 
                type1s.append(item)
            elif item[1] == TYPE2:
                for t1 in type1s:
                    yield t1 + item[1:]

如果它就在之前，它甚至更简单：只需跟踪前一项并在每次类型为1时产生元组，并且当前的类型为type2。

以下是使用示例，并返回结果：

l=[[1, TYPE1, 10, 15],
   [1, TYPE2, 20, 25],  # match with first
   [1, TYPE2, 30, 35],  # match with first (2 total matches)

   [2, TYPE2, 10, 15],  # No match
   [2, TYPE1, 20, 25],
   [2, TYPE1, 30, 35],
   [2, TYPE2, 40, 45],  # Match with previous 2 type1s.
   [2, TYPE1, 50, 55],
   [2, TYPE2, 60, 65],  # Match with 3 previous type1 entries (5 total)
   ]

for x in find_t1_t2(l):
    print x

返回：

[1, 'type1', 10, 15, 'type2', 20, 25]
[1, 'type1', 10, 15, 'type2', 30, 35]
[2, 'type1', 20, 25, 'type2', 40, 45]
[2, 'type1', 30, 35, 'type2', 40, 45]
[2, 'type1', 20, 25, 'type2', 60, 65]
[2, 'type1', 30, 35, 'type2', 60, 65]
[2, 'type1', 50, 55, 'type2', 60, 65]

高效的元组列表比较

5 个答案: