使用理解更新数据框列

时间:2018-03-02 10:51:13

标签: python

我一直在用列表推导更新数据帧列,没有问题。 如果我在数据框上有一个过滤器,这会引发问题,即使理解返回正确的值,列也不会更新。 以下是一个人为的例子,纯粹是为了说明这个问题。

如果填充了区域,我首先将Town列更新为Region。 然后我尝试在地址中找到Town的值,如果它尚未填充。问题是第二个更新语句不起作用。

很明显,我对理解的理解是不充分的,所以我会对我做错的建议表示赞赏。 谢谢!

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
import pyodbc

#create dataframe

data = [{'Address': '123 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
    {'Address': '2345 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
    {'Address': '43 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
    {'Address': '1 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
    {'Address': '43 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},
    {'Address': '6 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},
    {'Address': '45 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},]

dataset = pd.DataFrame(data)

#set Town column to the region.
dataset['Town'] = [r for r in dataset['Region']]

#if Town column is still blank, find the region in the Address, correcting for a known bad spelling
dataset[dataset['Town'] =='']['Town']  =  ['Nebraska' if sub.split(",")[2].strip() =='NOBraska' else sub.split(",")[2].strip() for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]  

#RESULT: dataset['Town'] is not updated for the case when it is empty are not updated

3 个答案:

答案 0 :(得分:2)

这里的问题是,通过使用df[rows][cols]访问方法,您不是访问原始DataFrame值,而是访问副本。

您确实应该收到如下警告:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

详细描述了这种情况here

通常,在分配到DataFrame的切片时,您应始终使用.iloc.loc

以下是如何重新编写分配以实际修改DataFrame的示例:

new_values = ['Nebraska' if sub.split(",")[2].strip() =='NOBraska'
              else sub.split(",")[2].strip()
              for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]

# In this way I am getting the labels of the index, so that I can use .loc
empty_town_rows = dataset.index[dataset['Town'] =='']

dataset.loc[empty_town_rows, 'Town']  =  new_values

就个人而言,我总是喜欢在修改DataFrame的值时使用.loc / .iloc,所以我也会重写第一个赋值。但这不是必要的,因为没有视图与副本的问题。

dataset.loc[:, 'Town'] = [r for r in dataset['Region']]

答案 1 :(得分:0)

我建议您使用 loc 来更新数据框中的值。

在您的情况下,您应该使用

dataset.loc[dataset['Town'] =='', 'Town'] = ['Nebraska' if sub.split(",")[2].strip() =='NOBraska' else sub.split(",")[2].strip() for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]

就个人而言,我建议你这样做

updateTown = lambda row: row["Region"] if row["Region"] else row["Address"].split(",")[2].strip()
dataset['Town'] = dataset.apply(updateTown, axis=1)

答案 2 :(得分:0)

@FLab describes问题就好了。

但是你的代码可以进一步改进,以使其更具性能/可读性:

def replacer(sub):
    x = sub.split(',')[2].strip()
    return 'Nebraska' if x == 'NOBaska' else x

dataset.loc[dataset['Town'] == '', 'Town']  =  \
dataset.loc[dataset['Town'] == '', 'Address'].astype(str).apply(replacer)