数据帧
df = pd.DataFrame({'A': [['gener'], ['gener'], ['system'], ['system'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum', 'toledo']], 'B': [['gutter'], ['gutter'], ['gutter', 'system'], ['gutter', 'guard', 'system'], ['ohio', 'gutter'], ['gutter', 'toledo'], ['toledo', 'gutter'], ['gutter'], ['gutter'], ['gutter'], ['how', 'to', 'instal', 'aluminum', 'gutter'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'color'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'adrian', 'ohio'], ['aluminum', 'gutter', 'bowl', 'green', 'ohio'], ['aluminum', 'gutter', 'maume', 'ohio'], ['aluminum', 'gutter', 'perrysburg', 'ohio'], ['aluminum', 'gutter', 'tecumseh', 'ohio'], ['aluminum', 'gutter', 'toledo', 'ohio']]}, columns=['A', 'B'])
看起来像什么
我有一个包含两列列表的数据框。
A B
0 [gener] [gutter]
1 [gener] [gutter]
2 [system] [gutter, system]
3 [system] [gutter, guard, system]
4 [gutter] [ohio, gutter]
5 [gutter] [gutter, toledo]
6 [gutter] [toledo, gutter]
7 [gutter] [gutter]
8 [gutter] [gutter]
9 [gutter] [gutter]
10 [aluminum] [how, to, instal, aluminum, gutter]
11 [aluminum] [aluminum, gutter]
12 [aluminum] [aluminum, gutter, color]
13 [aluminum] [aluminum, gutter]
14 [aluminum] [aluminum, gutter, adrian, ohio]
15 [aluminum] [aluminum, gutter, bowl, green, ohio]
16 [aluminum] [aluminum, gutter, maume, ohio]
17 [aluminum] [aluminum, gutter, perrysburg, ohio]
18 [aluminum] [aluminum, gutter, tecumseh, ohio]
19 [aluminum, toledo] [aluminum, gutter, toledo, ohio]
问题
如果我有列的列,是否有一个pandas函数可以让我操作整个列表数组来检查交集并返回一个布尔值或交叉值作为一个新系列?
例如,我想让大熊猫拥有相同的东西:
def intersection(df, col1, col2, return_type='boolean'):
if return_type == 'boolean':
df = df[[col1, col2]]
s = []
for idx in df.iterrows():
s.append(any([phrase in idx[1][0] for phrase in idx[1][1]]))
S = pd.Series(s)
return S
elif return_type == 'word':
df = df[[col1, col2]]
s = []
for idx in df.iterrows():
s.append(', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))]))
S = pd.Series(s)
return S
#Create column C in df
df['C'] = intersection(df, 'A', 'B', 'word')
...无需编写自己的函数或求助于for循环。我觉得必须有一种更简单的方法来比较同一行中两列中的列表,看它们是否相交。
我可以使用for
循环执行此操作,但这对我来说很难看
for
循环返回boolean
系列:
for idx in df.iterrows():
any([phrase in idx[1][0] for phrase in idx[1][1]])
产地:
False
False
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
或者,使用set
s找到相交的单词:
for idx in df.iterrows():
', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))])
''
''
'system'
'system'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'toledo, aluminum'
答案 0 :(得分:9)
检查df.A
中是否包含df.B
中的每个项目:
>>> df.apply(lambda row: all(i in row.B for i in row.A), axis=1)
# OR: ~(df['A'].apply(set) - df['B'].apply(set)).astype(bool)
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
dtype: bool
获得工会:
df['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df.A, df.B)]
>>> df
A B intersection
0 [gener] [gutter] []
1 [gener] [gutter] []
2 [system] [gutter, system] [system]
3 [system] [gutter, guard, system] [system]
4 [gutter] [ohio, gutter] [gutter]
5 [gutter] [gutter, toledo] [gutter]
6 [gutter] [toledo, gutter] [gutter]
7 [gutter] [gutter] [gutter]
8 [gutter] [gutter] [gutter]
9 [gutter] [gutter] [gutter]
10 [aluminum] [how, to, instal, aluminum, gutter] [aluminum]
11 [aluminum] [aluminum, gutter] [aluminum]
12 [aluminum] [aluminum, gutter, color] [aluminum]
13 [aluminum] [aluminum, gutter] [aluminum]
14 [aluminum] [aluminum, gutter, adrian, ohio] [aluminum]
15 [aluminum] [aluminum, gutter, bowl, green, ohio] [aluminum]
16 [aluminum] [aluminum, gutter, maume, ohio] [aluminum]
17 [aluminum] [aluminum, gutter, perrysburg, ohio] [aluminum]
18 [aluminum] [aluminum, gutter, tecumseh, ohio] [aluminum]
19 [aluminum, toledo] [aluminum, gutter, toledo, ohio] [aluminum, toledo]
答案 1 :(得分:1)
只需使用apply
支持的pandas
功能即可,非常棒。
由于交叉可能有两列以上,辅助功能可以像这样准备,然后应用DataFrame.apply
功能(参见http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html,注意选项axis=1
表示"整个系列"而axis=0
表示"沿着系列",其中一个
系列只是数据框中的一列)。然后,跨列的每一行都作为可迭代的Series
对象传递给应用的函数。
def intersect(ss):
ss = iter(ss)
s = set(next(ss))
for t in ss:
s.intersection_update(t) # `t' must not be a `set' here, `list' or any `Iterable` is OK
return s
res = df.apply(intersect, axis=1)
>>> res
0 {}
1 {}
2 {system}
3 {system}
4 {gutter}
5 {gutter}
6 {gutter}
7 {gutter}
8 {gutter}
9 {gutter}
10 {aluminum}
11 {aluminum}
12 {aluminum}
13 {aluminum}
14 {aluminum}
15 {aluminum}
16 {aluminum}
17 {aluminum}
18 {aluminum}
19 {aluminum, toledo}
您可以对辅助功能的结果进行进一步的操作,或者进行类似的一些变化。
希望这有帮助。