在某些列中分隔具有相同值的行

时间:2017-04-16 16:19:19

标签: python pandas dataframe

您好我有这种形式的数据框:

    Episode    Number Rating Series
    4 Days Out   2.9    9.1  "Breaking Bad" (2008)
    Buyout       5.6    9.0 "Breaking Bad" (2008)
    Pilot        1.1    9.0 "Breaking Bad" (2008)
    Dog Fight    1.12   9.0 "Suits" (2011)
    We're Done   4.7    9.0 "Suits" (2011)
    Privilege    5.6    8.9 "Suits" (2011)
    Pilot        1.1    8.9 "Suits" (2011)

我想为此数据框创建一个名为watched的新列,我将在列表中提供剧集编号(来自'Number'列)并在其中应用where where方法,以便观看的列将有或没有价值观。

watchlist=[1.1, 4.7, 2.9]
df['watched'] = np.where(df['Number'].isin(watchlist), 'no', 'yes')

所以这会创建一个新的列,其中第4.7,2.9和1.1集的行中存在'无'值,但问题是我希望仅在其中一个中使用'否',而不是两者。是否有办法以某种方式区分列号中值为“1.1”的那两行? (它们在'Series'列中具有不同的值,但在'Episode'列中具有相同的值。

2 个答案:

答案 0 :(得分:1)

对于单个关注列表

您可以将选定的isinnp.where一起使用,方法是选择要检查的系列,并为每个系列使用不同的监视列表。对于您的数据框df:

      Episode  Number  Rating               Series
0  4 Days Out    2.90     9.1  Breaking Bad (2008)
1      Buyout    5.60     9.0  Breaking Bad (2008)
2       Pilot    1.10     9.0  Breaking Bad (2008)
3   Dog Fight    1.12     9.0         Suits (2011)
4  We're Done    4.70     9.0         Suits (2011)
5   Privilege    5.60     8.9         Suits (2011)
6       Pilot    1.10     8.9         Suits (2011)

watchlist

[1.1, 4.7, 2.9]

假设关注列表仅适用于Breaking Bad。使用np.where仅将函数应用于与Breaking Bad (2008)匹配的行,然后使用isin查看Rating列中的值是否在watchlist中<: / p>

df['Breaking Bad Watched'] = df['Number'][np.where(df['Series'] == "Breaking Bad (2008)")[0]].isin(watchlist)

给出:

      Episode  Number  Rating               Series Breaking Bad Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)                 True
1      Buyout    5.60     9.0  Breaking Bad (2008)                False
2       Pilot    1.10     9.0  Breaking Bad (2008)                 True
3   Dog Fight    1.12     9.0         Suits (2011)                  NaN
4  We're Done    4.70     9.0         Suits (2011)                  NaN
5   Privilege    5.60     8.9         Suits (2011)                  NaN
6       Pilot    1.10     8.9         Suits (2011)                  NaN

然后使用maptrue / false转换为yes / no

d = {True: 'Yes', False: 'No'}
df['Breaking Bad Watched'] = df['Breaking Bad Watched'].map(d)

      Episode  Number  Rating               Series Breaking Bad Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)                  Yes
1      Buyout    5.60     9.0  Breaking Bad (2008)                   No
2       Pilot    1.10     9.0  Breaking Bad (2008)                  Yes
3   Dog Fight    1.12     9.0         Suits (2011)                  NaN
4  We're Done    4.70     9.0         Suits (2011)                  NaN
5   Privilege    5.60     8.9         Suits (2011)                  NaN
6       Pilot    1.10     8.9         Suits (2011)                  NaN

------------------------对于一个关注的词典--------------- -----

如果您有一个列表,其中系列和剧集编号是单独指定的:

watchlist = {'Breaking Bad (2008)': [1.1, 4.7, 2.9], 'Suits (2011)': [4.7, 5.6]}

您可以按如下方式进行交流:

# Save name of new columns into new_col_list
new_col_list = []

for series, wlist in watchlist.iteritems():
    # Save names of new columns into new_col_list
    new_col_list.append('{} Watched'.format(series))
    # Do calculation
    print series, wlist
    df['{} Watched'.format(series)] = df['Number'][np.where(df['Series'] == series)[0]].isin(wlist)

这会给你:

      Episode  Number  Rating               Series  \
0  4 Days Out    2.90     9.1  Breaking Bad (2008)   
1      Buyout    5.60     9.0  Breaking Bad (2008)   
2       Pilot    1.10     9.0  Breaking Bad (2008)   
3   Dog Fight    1.12     9.0         Suits (2011)   
4  We're Done    4.70     9.0         Suits (2011)   
5   Privilege    5.60     8.9         Suits (2011)   
6       Pilot    1.10     8.9         Suits (2011)   

  Breaking Bad (2008) Watched Suits (2011) Watched  
0                        True                  NaN  
1                       False                  NaN  
2                        True                  NaN  
3                         NaN                False  
4                         NaN                 True  
5                         NaN                 True  
6                         NaN                False  

new_col_list = ['Breaking Bad (2008) Watched', 'Suits (2011) Watched']

[1]如果只有几个名称,则手动编写它们:然后使用pd.concatenate连接两个监视列,并删除这些列:

df['Watched'] = pd.concat([df['Breaking Bad (2008) Watched'].dropna(), df['Suits (2011) Watched'].dropna()])
# Remove old Columns
df.drop(['Breaking Bad (2008) Watched','Suits (2011) Watched'], axis=1, inplace=True)

[2]如果有一个列名列表,那么使用简单的列表推导将名称列表添加到pd.concat,迭代new_col_list中的列名:

df['Watched'] = pd.concat([df['{}'.format(i)].dropna() for i in new_col_list])
# Remove old Name Columns
df.drop(new_col_list, axis=1, inplace=True)

# Convert True False to Yes No
d = {True: 'Yes', False: 'No'}
df['Watched'] = df['Watched'].map(d)
# Final Output:
df:
      Episode  Number  Rating               Series Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)     Yes
1      Buyout    5.60     9.0  Breaking Bad (2008)      No
2       Pilot    1.10     9.0  Breaking Bad (2008)     Yes
3   Dog Fight    1.12     9.0         Suits (2011)      No
4  We're Done    4.70     9.0         Suits (2011)     Yes
5   Privilege    5.60     8.9         Suits (2011)     Yes
6       Pilot    1.10     8.9         Suits (2011)      No

<强>来源

isin的来源:

  

[1] How to check if a value is in the list in selection from pandas data frame?   http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html

concat的来源:

  

[2] https://stackoverflow.com/a/10972557/2254228

map的来源:

  

[3] Convert Pandas series containing string to boolean

答案 1 :(得分:1)

实现这一目标的方法简单而有效(比当前答案快2.5倍)。对于您的数据框df和关注列表watchlist字典,您可以将df.loc用于多个条件。

首先,创建占位符列:

df['Watched'] = 'No'

      Episode  Number  Rating               Series Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)      No
1      Buyout    5.60     9.0  Breaking Bad (2008)      No
2       Pilot    1.10     9.0  Breaking Bad (2008)      No
3   Dog Fight    1.12     9.0         Suits (2011)      No
4  We're Done    4.70     9.0         Suits (2011)      No
5   Privilege    5.60     8.9         Suits (2011)      No
6       Pilot    1.10     8.9         Suits (2011)      No

然后迭代监视列表:

for key, values in watchlist.iteritems():
    df.loc[(df['Number'].isin(values)) & (df['Series'] == key), 'Watched'] = 'yes'

这会给df

      Episode  Number  Rating               Series Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)     yes
1      Buyout    5.60     9.0  Breaking Bad (2008)      No
2       Pilot    1.10     9.0  Breaking Bad (2008)     yes
3   Dog Fight    1.12     9.0         Suits (2011)      No
4  We're Done    4.70     9.0         Suits (2011)     yes
5   Privilege    5.60     8.9         Suits (2011)     yes
6       Pilot    1.10     8.9         Suits (2011)      No

无需额外的列/连接或删除列。

Total time this answer = 0.00800013542175 s
Total time accepted answer = 2.624944121596675 s