pandas数据框中的新列基于现有列值

时间:2016-12-16 17:18:49

标签: python pandas dataframe

我有一个数据框,其中包含一个名为'States'的列,其中列出了美国各州。我需要使用区域说明符'Atlantic Coast'创建另一个列我有属于各个区域的状态列表,所以如果df['States']中的状态匹配列表'Atlantic_states'中的状态,则说明符'Atlantic Coast'已插入新列df['region specifier'],下面的代码显示了我想要将我的数据框值与df['States']列的输出进行比较的列表。

 #list of states
 Atlantic_states = ['Virginia',
              'Massachusetts',
              'Maine',
              'New York',
              'Rhode Island',
              'Connecticut',
              'New Hampshire',
              'Maryland',
              'Delaware',
              'New Jersey',
              'North Carolina',
              'South Carolina',
              'Georgia',
              'Florida']
 print(df['States'])

 Out:
                 States
       0         Virginia
       1    Massachusetts
       2            Maine
       3         New York
       4     Rhode Island
       5      Connecticut
       6    New Hampshire
       7         Maryland
       8         Delaware
       9       New Jersey
       10  North Carolina
       11  South Carolina
       12         Georgia
       13         Florida
       14       Wisconsin
       15        Michigan
       16            Ohio
       17    Pennsylvania
       18        Illinois
       19         Indiana
       20       Minnesota
       21        New York
       22      Washington
       23          Oregon
       24      California

2 个答案:

答案 0 :(得分:3)

虽然安迪的回答有效,但这并不是最有效的方法。有一个方便的方法可以在几乎所有 pandas类似系列的对象上调用:.isin()。对此的条目可以是列表,dicts和pandas系列。

df = pd.DataFrame(['Virginia','Massachusetts','Maine','New York','Rhode Island',
                   'Connecticut','New Hampshire','Maryland', 'Delaware',
                   'New Jersey','North Carolina', 'South Carolina','Georgia','Florida',
                   'Wisconsin','Michigan', 'Ohio','Pennsylvania','Illinois',
                   'Indiana','Minnesota','New York','Washington','Oregon',
                   'California'],
                  columns=['States'])

Atlantic_states = ['Virginia', 'Massachusetts', 'Maine', 'New York','Rhode Island',
                   'Connecticut', 'New Hampshire',  'Maryland', 'Delaware',
                   'New Jersey', 'North Carolina', 'South Carolina', 'Georgia',
                   'Florida']

df['Coast'] = np.where(df['States'].isin(Atlantic_states), 'Atlantic Coast',
                       'Unknown')
df.head()

Out[1]: 

    States          Coast
0   Virginia        Atlantic Coast
1   Massachusetts   Atlantic Coast
2   Maine           Atlantic Coast
3   New York        Atlantic Coast
4   Rhode Island    Atlantic Coast

基准

以下是一些用于将字母表的前10个字母映射到一些随机int数字的时序:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(low=0, high=26, size=(1000000,1)),
                  columns=['numbers'])
letters = dict(zip(list(range(0, 10)), [i for i in 'abcdefghij']))

申请

%%timeit
def is_atlantic(state):
    return True if state in letters else False

df.numbers.apply(is_atlantic)

Out[]: 1 loops, best of 3: 432 ms per loop

现在为JohnE建议的地图

%%timeit
df.numbers.map(letters)

Out[]: 10 loops, best of 3: 56.9 ms per loop

最后是isin(也由Nickil Maveli建议)

%%timeit 
df.numbers.isin(letters)

Out[]: 10 loops, best of 3: 20.9 ms per loop

因此,我们发现.isin().apply()快得多,速度是.map()的两倍。

注意:applyisin只返回布尔掩码和map填充所需的字符串。即便如此,当分配到另一列isin时,map的时间约为2/3。

答案 1 :(得分:2)

你有几个选择。首先,直接回答提出的问题:

选项1

创建一个函数,返回状态是否在大西洋地区

def is_atlantic(state):
    return "Atlantic" if state in Atlantic_states else "Unknown"

现在,您使用.apply()并获取结果(并将其返回到新列)

df['Region'] = df['State'].apply(is_atlantic)

这将返回如下所示的数据框:

    State           Region
0   Virginia        Atlantic
1   Massachusetts   Atlantic
2   Maine           Atlantic
3   New York        Atlantic
4   Rhode Island    Atlantic
5   Connecticut     Atlantic
6   New Hampshire   Atlantic
7   Maryland        Atlantic
8   Delaware        Atlantic
9   New Jersey      Atlantic
10  North Carolina  Atlantic
11  South Carolina  Atlantic
12  Georgia         Atlantic
13  Florida         Atlantic
14  Wisconsin       Unknown
15  Michigan        Unknown
16  Ohio            Unknown
17  Pennsylvania    Unknown
18  Illinois        Unknown
19  Indiana         Unknown
20  Minnesota       Unknown
21  New York        Atlantic
22  Washington      Unknown
23  Oregon          Unknown
24  California      Unknown

选项2

如果您要检查多个列表,则第一个选项会变得很麻烦。我建议创建一个以State为键,区域为值的字典,而不是多个列表。只有50个值,这应该很容易维护。

    regions = {
    'Virginia': 'Atlantic',
    'Massachusetts': 'Atlantic',
    'Maine': 'Atlantic',
    'New York': 'Atlantic',
    'Rhode Island': 'Atlantic',
    'Connecticut': 'Atlantic',
    'New Hampshire': 'Atlantic',
    'Maryland': 'Atlantic',
    'Delaware': 'Atlantic',
    'New Jersey': 'Atlantic',
    'North Carolina': 'Atlantic',
    'South Carolina': 'Atlantic',
    'Georgia': 'Atlantic',
    'Florida': 'Atlantic',
    'Wisconsin': 'Midwest',
    'Michigan': 'Midwest',
    'Ohio': 'Midwest',
    'Pennsylvania': 'Midwest',
    'Illinois': 'Midwest',
    'Indiana': 'Midwest',
    'Minnesota': 'Midwest',
    'New York': 'Atlantic',
    'Washington': 'West',
    'Oregon': 'West',
    'California': 'West'
}

您可以再次使用.apply(),稍加修改一下功能:

def get_region(state):
    return regions[state]

df['Region'] = df['State'].apply(get_region)

这次您的数据框如下所示:

    State           Region
0   Virginia        Atlantic
1   Massachusetts   Atlantic
2   Maine           Atlantic
3   New York        Atlantic
4   Rhode Island    Atlantic
5   Connecticut     Atlantic
6   New Hampshire   Atlantic
7   Maryland        Atlantic
8   Delaware        Atlantic
9   New Jersey      Atlantic
10  North Carolina  Atlantic
11  South Carolina  Atlantic
12  Georgia         Atlantic
13  Florida         Atlantic
14  Wisconsin       Midwest
15  Michigan        Midwest
16  Ohio            Midwest
17  Pennsylvania    Midwest
18  Illinois        Midwest
19  Indiana         Midwest
20  Minnesota       Midwest
21  New York        Atlantic
22  Washington      West
23  Oregon          West
24  California      West