将df中的一列映射到所有单词都存在的另一df中

时间:2019-06-04 16:48:47

标签: pandas python-2.7 numpy

我正在尝试将列映射到另一个数据帧中的数据帧,其中所有单词都存在于目标数据帧中

多次匹配很好,因为我之后可以将它们过滤掉。 预先感谢!

df1
ColA
this is a sentence
with some words
in a column
and another
for fun

df2
ColB        ColC
this a      123
in column   456
fun times   789

一些尝试

dfResult = df1.apply(lambda x: np.all([word in x.df1['ColA'].split(' ') for word in x.df2['ColB'].split(' ')]),axis = 1)

dfResult = df1.ColA.apply(lambda sentence: all(word in sentence for word in df2.ColB))

所需的输出


dfResult
ColA                 ColC
this is a sentence   123
with some words      NaN
in a column          456
and another          NaN
for fun              NaN

2 个答案:

答案 0 :(得分:3)

转到set并查找具有Numpy广播的子集

免责声明:不能保证很快。

A = df1.ColA.str.split().apply(set).to_numpy()  # If pandas version is < 0.24 use `.values`
B = df2.ColB.str.split().apply(set).to_numpy()  # instead of `.to_numpy()`
C = df2.ColC.to_numpy()

# When `dtype` is `object` Numpy falls back on performing
# the operation on each pair of values.  Since these are `set` objects
# `<=` tests for subset.
i, j = np.where(B <= A[:, None])
out = pd.array([np.nan] * len(A), pd.Int64Dtype())  # Empty nullable integers
# Use `out = np.empty(len(A), dtype=object)` if pandas version is < 0.24
out[i] = C[j]

df1.assign(ColC=out)

                 ColA  ColC
0  this is a sentence   123
1     with some words   NaN
2         in a column   456
3         and another   NaN
4             for fun   NaN

答案 1 :(得分:1)

使用循环和set.issubset

pd.DataFrame([[y if set(z.split()).issubset(set(x.split())) else np.nan for z,y in zip(df2.ColB,df2.ColC)] for x in df1.ColA ]).max(1)
Out[34]: 
0    123.0
1      NaN
2    456.0
3      NaN
4      NaN
dtype: float64