我有一个像这样的数据框:
ColA ColB ColC
"lorem ipsum" ["lorem", "foo", "bar"]
"lorem ipsum" NaN
NaN ["lorem", "foo", "bar"]
NaN NaN
我正在尝试获取此输出:
ColA ColB ColC
"lorem ipsum" ["lorem", "foo", "bar"] "lorem"
我试图使用这样的理解列表:
df["C"] = [elem for elem in df["B"] if elem in df["A"] ]
但没有成功:
TypeError: unhashable type: 'list'
如果我将ColB格式化为列表,并且,
ValueError: Length of values does not match length of index
如果我使用元组
一些帮助将不胜感激, 谢谢。
编辑+编辑2:两列中只有一个单词(或无),我需要抓住它才能将其放置在C列中。 我还忘了提到ColA和ColB可以将NaN作为值。
答案 0 :(得分:2)
对try+except
使用自定义函数,并通过pipe
传递DataFrame:
df = pd.DataFrame({'A':['lorem ipsum','lorem ipsum',np.nan, np.nan],
'B':[["lorem", "foo", "bar"], np.nan, ["lorem", "foo", "bar"], np.nan]})
print (df)
A B
0 lorem ipsum [lorem, foo, bar]
1 lorem ipsum NaN
2 NaN [lorem, foo, bar]
3 NaN NaN
def test(df):
out = []
for a, b in zip(df["A"], df["B"]):
try:
out.append(next(y for y in b if y in a))
except Exception:
out.append('')
return out
df["C"] = df.pipe(test)
print (df)
A B C
0 lorem ipsum [lorem, foo, bar] lorem
1 lorem ipsum NaN
2 NaN [lorem, foo, bar]
3 NaN NaN
另一种效果不佳的解决方案:
df = df.fillna("undefined")
df["C"] = [next((y for y in b if y in a), '') for a, b, in zip(df["A"],df["B"])]
print (df)
A B C
0 lorem ipsum [d, foo, bar]
1 lorem ipsum undefined u
2 undefined [lorem, foo, bar]
3 undefined undefined u
答案 1 :(得分:1)
您可以定义自定义函数,然后使用map
:
# data adapted from @jezrael
df = pd.DataFrame({'A':['lorem ipsum', 'lorem ipsum', np.nan, np.nan, 'test string'],
'B':[["lorem", "foo", "bar"], np.nan, ["lorem", "foo", "bar"], np.nan, ["no", "match"]]})
def tester(val1, val2):
if (val1 != val1) or (val2 != val2):
return ''
return next((x for x in val2 if x in val1), '')
df['C'] = list(map(tester, df['A'], df['B']))
''
的默认参数可确保您有一个空字符串,其中没有匹配项。我们还利用了np.nan != np.nan
这一事实。
结果:
print(df)
A B C
0 lorem ipsum [lorem, foo, bar] lorem
1 lorem ipsum NaN
2 NaN [lorem, foo, bar]
3 NaN NaN
4 test string [no, match]
答案 2 :(得分:0)
在我将每个NaN替换为fillna之后,以前的解决方案就像一个魅力一样发挥作用。
df = df.fillna("undefined")
df["C"] = [next((y for y in b if y in a), '') for a, b, in zip(df["A"],df["B"])]
谢谢
答案 3 :(得分:0)
除了尝试和解决方案外,它仅需一个字!
df = pd.DataFrame({'colA':['lorem ipsum','lorem ipsum',None,None],
'colB':[["lorem", "foo", "bar"],None,["lorem", "foo", "bar"],None]})
df.loc[:,'colC'] = df.apply(lambda x: ''.join([w for w in x.colA.split() \
if w in x.colB]) if all(x) else '',axis=1 )
colA colB colC
0 lorem ipsum [lorem, foo, bar] lorem
1 lorem ipsum None None
2 None [lorem, foo, bar] None
3 NaN None None