字符串包含两个pandas系列

时间:2018-02-28 23:30:47

标签: python string pandas dataframe

我在pandas数据框中有一些字符串。我想在相邻列中搜索该字符串的存在。

在下面的例子中,我想搜索'choice'系列中的字符串是否包含在'fruit'系列中,在新列'choice_match'中返回true(1)或false(0)。< / p>

示例DataFrame:

import pandas as pd
d = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'fruit': [
'apple, banana', 'apple', 'apple', 'pineapple', 'apple, pineapple',            'orange', 'apple, orange', 'orange', 'banana', 'apple, peach'],
'choice': ['orange', 'orange', 'apple', 'pineapple', 'apple', 'orange',  'orange', 'orange', 'banana', 'banana']}
df = pd.DataFrame(data=d)

Desired DataFrame:

import pandas as pd
d = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'fruit': [
'apple, banana', 'apple', 'apple', 'pineapple', 'apple, pineapple',   'orange', 'apple, orange', 'orange', 'banana', 'apple, peach'],
'choice': ['orange', 'orange', 'apple', 'pineapple', 'apple', 'orange',      'orange', 'orange', 'banana', 'banana'],
'choice_match': [0, 0, 1, 1, 1, 1, 1, 1, 1, 0]}
df = pd.DataFrame(data=d)

4 个答案:

答案 0 :(得分:5)

In [75]: df['choice_match'] = (df['fruit']
                                 .str.split(',\s*', expand=True)
                                 .eq(df['choice'], axis=0)
                                 .any(1).astype(np.int8))

In [76]: df
Out[76]:
   ID     choice             fruit  choice_match
0   1     orange     apple, banana             0
1   2     orange             apple             0
2   3      apple             apple             1
3   4  pineapple         pineapple             1
4   5      apple  apple, pineapple             1
5   6     orange            orange             1
6   7     orange     apple, orange             1
7   8     orange            orange             1
8   9     banana            banana             1
9  10     banana      apple, peach             0

一步一步:

In [78]: df['fruit'].str.split(',\s*', expand=True)
Out[78]:
           0          1
0      apple     banana
1      apple       None
2      apple       None
3  pineapple       None
4      apple  pineapple
5     orange       None
6      apple     orange
7     orange       None
8     banana       None
9      apple      peach

In [79]: df['fruit'].str.split(',\s*', expand=True).eq(df['choice'], axis=0)
Out[79]:
       0      1
0  False  False
1  False  False
2   True  False
3   True  False
4   True  False
5   True  False
6  False   True
7   True  False
8   True  False
9  False  False

In [80]: df['fruit'].str.split(',\s*', expand=True).eq(df['choice'], axis=0).any(1)
Out[80]:
0    False
1    False
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
dtype: bool

In [81]: df['fruit'].str.split(',\s*', expand=True).eq(df['choice'], axis=0).any(1).astype(np.int8)
Out[81]:
0    0
1    0
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    0
dtype: int8

答案 1 :(得分:5)

这是一种方式:

df['choice_match'] = df.apply(lambda row: row['choice'] in row['fruit'].split(','),\
                              axis=1).astype(int)

<强>解释

  • df.apply axis=1遍历每一行并应用逻辑;它接受匿名的lambda函数。
  • row['fruit'].split(',')fruit列创建一个列表。这是必要的,例如,apple中未考虑pineapple
  • astype(int)是将布尔值转换为整数以进行显示的必要条件。

答案 2 :(得分:4)

选项1
使用Numpy的find
如果find找不到该值,则会返回-1

from numpy.core.defchararray import find

choice = df.choice.values.astype(str)
fruit = df.fruit.values.astype(str)

df.assign(choice_match=(find(fruit, choice) > -1).astype(np.uint))

   ID     choice             fruit  choice_match
0   1     orange     apple, banana             0
1   2     orange             apple             0
2   3      apple             apple             1
3   4  pineapple         pineapple             1
4   5      apple  apple, pineapple             1
5   6     orange            orange             1
6   7     orange     apple, orange             1
7   8     orange            orange             1
8   9     banana            banana             1
9  10     banana      apple, peach             0

选项2
设置逻辑
使用set s <是严格的子集,<=是子集。让自己pd.Series set <=,并使用choice = df.choice.apply(lambda x: set([x])) fruit = df.fruit.str.split(', ').apply(set) df.assign(choice_match=(choice <= fruit).astype(np.uint)) ID choice fruit choice_match 0 1 orange apple, banana 0 1 2 orange apple 0 2 3 apple apple 1 3 4 pineapple pineapple 1 4 5 apple apple, pineapple 1 5 6 orange orange 1 6 7 orange apple, orange 1 7 8 orange orange 1 8 9 banana banana 1 9 10 banana apple, peach 0 来确定一列的集合是否是其他列集的子集。

get_dummies

选项3
灵感来自@Wen's answer
使用maxc = pd.get_dummies(df.choice) f = df.fruit.str.get_dummies(', ') df.assign(choice_match=pd.DataFrame.mul(*c.align(f, 'inner')).max(1)) ID choice fruit choice_match 0 1 orange apple, banana 0 1 2 orange apple 0 2 3 apple apple 1 3 4 pineapple pineapple 1 4 5 apple apple, pineapple 1 5 6 orange orange 1 6 7 orange apple, orange 1 7 8 orange orange 1 8 9 banana banana 1 9 10 banana apple, peach 0

{{1}}

答案 3 :(得分:3)

嗯找到一种有趣的方式import React, { Component } from 'react'; const date = new Date(); const time = date.getHours(); const backgroundImages = [ 'http://via.placeholder.com/350x150', 'http://via.placeholder.com/350x300', 'http://via.placeholder.com/150x150', 'http://via.placeholder.com/350x150', 'http://via.placeholder.com/350x200', 'http://via.placeholder.com/450x150', 'http://via.placeholder.com/350x450', 'http://via.placeholder.com/750x300', 'http://via.placeholder.com/150x850', 'http://via.placeholder.com/350x150', 'http://via.placeholder.com/350x300', 'http://via.placeholder.com/150x150', 'http://via.placeholder.com/350x150', 'http://via.placeholder.com/350x200', 'http://via.placeholder.com/450x150', 'http://via.placeholder.com/350x450', 'http://via.placeholder.com/750x300', 'http://via.placeholder.com/150x850', ] class App extends Component { constructor(props){ super(props); this.state = { image_src: '' } } componentWillMount() { this.setState({image_src:backgroundImages[time] }) } render() { return ( <div> <img src={this.state.image_src} alt=""/> <p>{time}</p> </div> ); } } export default App;

get_dummies

分配后

(df.fruit.str.replace(' ','').str.get_dummies(',')+df.choice.str.get_dummies()).gt(1).any(1)
Out[726]: 
0    False
1    False
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
dtype: bool