我有一个包含2列[客户ID,群集]的数据框(df_cluster)。大约有13个群集,我正在尝试使用python中的apply()为每个群集分配一个名称。我过去使用过相同的功能,但效果很好,但是现在出现“ UnboundLocalError”错误。
如果我做错了任何事情,请告诉我。我对apply()的理解是,它沿轴传递了函数(在这种情况下,将为每一行传递函数cluster_name)
这是代码
def cluster_name(df):
if df['cluster'] == 1:
value = 'A'
elif df['cluster'] == 2:
value = 'B'
elif df['cluster'] == 3:
value = 'C'
elif df['cluster'] == 4:
value = 'D'
elif df['cluster'] == 5:
value = 'E'
elif df['cluster'] == 6:
value = 'F'
elif df['cluster'] == 7:
value = 'G'
return value
df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)
错误
UnboundLocalError Traceback (most recent call last)
<ipython-input-16-b64f3fdc1260> in <module>
16 return value
17
---> 18 df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)
19 df_cluster['cluster_name'].value_counts()
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6926 kwds=kwds,
6927 )
-> 6928 return op.get_result()
6929
6930 def applymap(self, func):
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-16-b64f3fdc1260> in cluster_name(df)
14 elif df['cluster'] == 7:
15 value = 'G'
---> 16 return value
17
18 df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)
UnboundLocalError: ("local variable 'value' referenced before assignment", 'occurred at index 0')
'''
答案 0 :(得分:0)
似乎您的问题已在评论中得到了解答,因此我将提出一种面向大熊猫的方法来解决您的问题。将apply(axis=1)
与DataFrame一起使用非常慢,几乎不需要(与遍历数据帧中的行相同),因此更好的方法是使用矢量化方法。最简单的方法是在字典中定义集群-> cluster_name映射,并使用map
方法:
df = pd.DataFrame(
{"cluster": [1,2,3,4,5,6,7]}
)
# repeat this dataframe 10000 times
df = pd.concat([df] * 10000)
应用方法:
def mapping_func(row):
if row['cluster'] == 1:
value = 'A'
elif row['cluster'] == 2:
value = 'B'
elif row['cluster'] == 3:
value = 'C'
elif row['cluster'] == 4:
value = 'D'
elif row['cluster'] == 5:
value = 'E'
elif row['cluster'] == 6:
value = 'F'
elif row['cluster'] == 7:
value = 'G'
else:
# This is a "catch-all" in case none of the values in the column are 1-7
value = "Z"
return value
%timeit df.apply(mapping_func, axis=1)
# 1.32 s ± 91.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.map
方法
mapping_dict = {
1: "A",
2: "B",
3: "C",
4: "D",
5: "E",
6: "F",
7: "G"
}
# the `fillna` is our "catch-all" statement.
# essentially if `map` encounters a value not in the dictionary
# it will place a NaN there. So I fill those NaNs with "Z" to
# be consistent with the above example
%timeit df["cluster"].map(mapping_dict).fillna("Z")
# 4.87 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我们可以看到,使用字典方法的map
比apply
快得多,同时还避免了较长的if/elif
语句链。
答案 1 :(得分:0)
您的函数中缺少else
:
def cluster_name(df):
if df['cluster'] == 1:
value = 'A'
elif df['cluster'] == 2:
value = 'B'
elif df['cluster'] == 3:
value = 'C'
elif df['cluster'] == 4:
value = 'D'
elif df['cluster'] == 5:
value = 'E'
elif df['cluster'] == 6:
value = 'F'
elif df['cluster'] == 7:
value = 'G'
else:
value = ...
return value
否则,如果value
不在值{1、2,...,7}中,则将不会设置df['cluster']
,这将导致异常。
答案 2 :(得分:0)
if-else
函数会被高估,并且可能会遗漏条件。'cluster_name'
,因此请使用string.ascii_uppercase
获取所有字母的list
,并将zip
赋予{{ 1}}
.map
中创建一个'cluster'
,以创建dict
列。'cluster_name'
不会有问题。
"local variable 'value' referenced before assignment"
条件时执行return value
,这意味着未分配if-else
在功能中。value