Question

假设我有两个数据帧A和B，每个数据帧包含两个名为x和y的列。我想加入这两个数据帧，但不是在两个数据帧上x和y列相等的行上，而是在A的x列是B的x列的子串并且y的相同的行上。例如

if A[x][1]='mpla' and B[x][1]='mplampla'

我希望能够捕获它。

在sql上它会是这样的：

select *
from A
join B
on A.x<=B.x and A.y<=B.y.

这样的事情可以在python上完成吗？

Answer 1

您可以将一个字符串一次与一列中的所有字符串匹配，如下所示：

import numpy.core.defchararray as ca

ca.find(B.x.values.astype(str), 'mpla') >= 0

问题在于你必须遍历A的所有元素。但如果你能负担得起，那就应该有效。

另请参阅：pandas + dataframe - select by partial string

Answer 2

你可以试试像

这样的东西

B.x.where(B.x.str.contains(A.x), B.index,         axis=index) #this would give you the ones that don't match 


B.x.where(B.x.str.match(A.x, as_indexer=True), B.index, axis=index) #this would also give you the one's that don't match.  You could see if you can use the "^" operator used for regex to get the ones that match.

您也可以尝试

np.where(B.x.str.contains(A.x), B.index, np.nan)

你也可以尝试：

matchingmask = B[B.x.str.contains(A.x)]

matchingframe = B.ix[matchingmask.index] #or 

matchingcolumn = B.ix[matchingmask.index].x #or

matchingindex = B.ix[matchingmask.index].index

所有这些都假设你在两个帧上都有相同的索引（我认为）

您想查看字符串方法：http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

你想阅读正则表达式和pandas，方法是：http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking

Python在符合条件的列上加入两个数据帧

2 个答案: