我一直在用列表推导更新数据帧列,没有问题。 如果我在数据框上有一个过滤器,这会引发问题,即使理解返回正确的值,列也不会更新。 以下是一个人为的例子,纯粹是为了说明这个问题。
如果填充了区域,我首先将Town列更新为Region。 然后我尝试在地址中找到Town的值,如果它尚未填充。问题是第二个更新语句不起作用。
很明显,我对理解的理解是不充分的,所以我会对我做错的建议表示赞赏。 谢谢!
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
import pyodbc
#create dataframe
data = [{'Address': '123 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
{'Address': '2345 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
{'Address': '43 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
{'Address': '1 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
{'Address': '43 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},
{'Address': '6 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},
{'Address': '45 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},]
dataset = pd.DataFrame(data)
#set Town column to the region.
dataset['Town'] = [r for r in dataset['Region']]
#if Town column is still blank, find the region in the Address, correcting for a known bad spelling
dataset[dataset['Town'] =='']['Town'] = ['Nebraska' if sub.split(",")[2].strip() =='NOBraska' else sub.split(",")[2].strip() for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]
#RESULT: dataset['Town'] is not updated for the case when it is empty are not updated
答案 0 :(得分:2)
这里的问题是,通过使用df[rows][cols]
访问方法,您不是访问原始DataFrame值,而是访问副本。
您确实应该收到如下警告:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
详细描述了这种情况here。
通常,在分配到DataFrame的切片时,您应始终使用.iloc
或.loc
。
以下是如何重新编写分配以实际修改DataFrame的示例:
new_values = ['Nebraska' if sub.split(",")[2].strip() =='NOBraska'
else sub.split(",")[2].strip()
for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]
# In this way I am getting the labels of the index, so that I can use .loc
empty_town_rows = dataset.index[dataset['Town'] =='']
dataset.loc[empty_town_rows, 'Town'] = new_values
就个人而言,我总是喜欢在修改DataFrame的值时使用.loc / .iloc,所以我也会重写第一个赋值。但这不是必要的,因为没有视图与副本的问题。
dataset.loc[:, 'Town'] = [r for r in dataset['Region']]
答案 1 :(得分:0)
我建议您使用 loc 来更新数据框中的值。
在您的情况下,您应该使用
dataset.loc[dataset['Town'] =='', 'Town'] = ['Nebraska' if sub.split(",")[2].strip() =='NOBraska' else sub.split(",")[2].strip() for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]
就个人而言,我建议你这样做
updateTown = lambda row: row["Region"] if row["Region"] else row["Address"].split(",")[2].strip()
dataset['Town'] = dataset.apply(updateTown, axis=1)
答案 2 :(得分:0)
@FLab describes问题就好了。
但是你的代码可以进一步改进,以使其更具性能/可读性:
def replacer(sub):
x = sub.split(',')[2].strip()
return 'Nebraska' if x == 'NOBaska' else x
dataset.loc[dataset['Town'] == '', 'Town'] = \
dataset.loc[dataset['Town'] == '', 'Address'].astype(str).apply(replacer)