熊猫Python:Col [C](如果值位于Col [A]和Col [B]中)

时间:2019-01-25 13:31:44

标签: python python-3.x string pandas

我有一个像这样的数据框:

    ColA             ColB                        ColC
"lorem ipsum"     ["lorem", "foo", "bar"]
"lorem ipsum"      NaN
NaN                ["lorem", "foo", "bar"]
NaN                 NaN

我正在尝试获取此输出:

    ColA             ColB                        ColC
"lorem ipsum"     ["lorem", "foo", "bar"]       "lorem"

我试图使用这样的理解列表:

df["C"] = [elem for elem in df["B"] if elem in df["A"] ]

但没有成功:

TypeError: unhashable type: 'list' 如果我将ColB格式化为列表,并且, ValueError: Length of values does not match length of index 如果我使用元组

一些帮助将不胜感激, 谢谢。

编辑+编辑2:两列中只有一个单词(或无),我需要抓住它才能将其放置在C列中。 我还忘了提到ColA和ColB可以将NaN作为值。

4 个答案:

答案 0 :(得分:2)

try+except使用自定义函数,并通过pipe传递DataFrame:

df = pd.DataFrame({'A':['lorem ipsum','lorem ipsum',np.nan, np.nan],
                   'B':[["lorem", "foo", "bar"], np.nan, ["lorem", "foo", "bar"], np.nan]})
print (df)
             A                  B
0  lorem ipsum  [lorem, foo, bar]
1  lorem ipsum                NaN
2          NaN  [lorem, foo, bar]
3          NaN                NaN

def test(df):
    out = []
    for a, b in zip(df["A"], df["B"]):
        try:
            out.append(next(y for y in b if y in a))
        except Exception:
            out.append('')
    return out

df["C"] = df.pipe(test)
print (df)
             A                  B      C
0  lorem ipsum  [lorem, foo, bar]  lorem
1  lorem ipsum                NaN       
2          NaN  [lorem, foo, bar]       
3          NaN                NaN       

另一种效果不佳的解决方案:

df = df.fillna("undefined")
df["C"] = [next((y for y in b if y in a), '') for a, b, in zip(df["A"],df["B"])]
print (df)


             A                  B  C
0  lorem ipsum      [d, foo, bar]   
1  lorem ipsum          undefined  u
2    undefined  [lorem, foo, bar]   
3    undefined          undefined  u

答案 1 :(得分:1)

您可以定义自定义函数,然后使用map

# data adapted from @jezrael
df = pd.DataFrame({'A':['lorem ipsum', 'lorem ipsum', np.nan, np.nan, 'test string'],
                   'B':[["lorem", "foo", "bar"], np.nan, ["lorem", "foo", "bar"], np.nan, ["no", "match"]]})

def tester(val1, val2):
    if (val1 != val1) or (val2 != val2):
        return ''
    return next((x for x in val2 if x in val1), '')

df['C'] = list(map(tester, df['A'], df['B']))

''的默认参数可确保您有一个空字符串,其中没有匹配项。我们还利用了np.nan != np.nan这一事实。

结果:

print(df)

             A                  B      C
0  lorem ipsum  [lorem, foo, bar]  lorem
1  lorem ipsum                NaN       
2          NaN  [lorem, foo, bar]       
3          NaN                NaN       
4  test string        [no, match]       

答案 2 :(得分:0)

在我将每个NaN替换为fillna之后,以前的解决方案就像一个魅力一样发挥作用。

df = df.fillna("undefined")
df["C"] = [next((y for y in b if y in a), '') for a, b, in zip(df["A"],df["B"])]

谢谢

答案 3 :(得分:0)

除了尝试和解决方案外,它仅需一个字!

df = pd.DataFrame({'colA':['lorem ipsum','lorem ipsum',None,None],
                   'colB':[["lorem", "foo", "bar"],None,["lorem", "foo", "bar"],None]})

df.loc[:,'colC'] = df.apply(lambda x: ''.join([w for w in x.colA.split() \
                             if w in x.colB]) if all(x) else '',axis=1 )

    colA    colB    colC
0   lorem ipsum [lorem, foo, bar]   lorem
1   lorem ipsum None    None
2   None    [lorem, foo, bar]   None
3   NaN None    None