如果T1是这样的话:
T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']})
和P是这个:
P = pd.DataFrame(data = {'val': ['E','A','D','B']})
如何在T1内获得P的位置?
就min和max而言,我希望看到这个返回
min max
3 6
8 11
如果这些数据帧表示为SQL表,我可以将此SQL方法转换为pandas:
DECLARE @Items INT = (SELECT COUNT(*) FROM @P);
SELECT MIN(t.KeyCol) AS MinKey,
MAX(t.KeyCol) AS MaxKey
FROM dbo.T1 AS t
INNER JOIN @P AS p ON p.Val = t.Val
GROUP BY t.KeyCol - p.KeyCol
HAVING COUNT(*) = @Items;
此SQL解决方案来自Pesomannen对http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-2
的回复答案 0 :(得分:0)
好吧,你可以随时做一个这样的解决方法:
t1 = ''.join(T1.val)
p = ''.join(P.val)
start, res = 0, []
while True:
try:
res.append(t1.index(p, start))
start = res[-1] + 1
except:
break
获取起始索引,然后通过mathing计算结束索引,并使用iloc访问数据帧。你应该使用基于0的索引(不是基于1的,就像在示例中那样)
答案 1 :(得分:0)
当然,这不会使用P
,但可能符合您的目的。
groups = T1.groupby(T1.val).groups
pd.DataFrame({'min': [min(x) for x in groups.values()],
'max': [max(x) for x in groups.values()]}, index=groups.keys())
产量
max min
E 7 2
B 10 0
D 9 1
A 8 3
[4 rows x 2 columns]
答案 2 :(得分:0)
我认为我已经按照与SQL解决方案相同的方法解决了这个问题 - 一种关系划分(即匹配值,按键列中的差异分组并选择具有计数等于子序列的大小):
import pandas as pd
T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']})
# use the index to create a new column that's going to be the key (zero based)
T1 = T1.reset_index()
# do the same for the subsequence that we want to find within T1
P = pd.DataFrame(data = {'val': ['E','A','D','B']})
P = P.reset_index()
# join on the val column
J = T1.merge(P,on=['val'],how='inner')
# group by difference in key columns calculating the min, max and count of the T1 key
FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'})
# Final result is where the count is the size of the subsequence - in this case 4
FullResult[FullResult['count'] == 4]
真的很喜欢使用熊猫!