我的策略是创建一个" name" - > "频率&#34 ;.然后将频率转换为字符串。如果字符串不常见,则应使用某些描述性字符串替换它。我希望有两个区域/阈值:" less_common"和#"罕见"或类似的东西。
这是我目前的尝试。我把它分成几行只是为了调试fyi。第3行不起作用。我在Python 3.6中使用conda,
tmp = df["name"].groupby(df["name"])
tmp = tmp.agg(['count'])
tmp['count'] = tmp["count"].apply(lambda x: "Uncommon" if tmp["count"] < 1000.0 else str(x) )
labelDict = tmp.to_dict()
#some code?
df[columnName].replace(labelDict, inplace=True)
pd.get_dummies(df, columns=['name'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
一些示例输入(还有其他列): name = a,a,a,a,b,b,b,c,c,d
name | count
a | 4
b | 3
c | 2
d | 1
Let's say T is =<2
a->4, b->3, c->"Uncommon", d->"Uncommon"
Remap dict to use the original values if name is numeric:
a->"a", b->"b", c->"Uncommon", d->"Uncommon"
As one hot:
date | id | name_a | name_b | name_Uncommon
... | ...| 1 | 0 | 0
我承认我找到了一个相关的解决方案,但目前尚不清楚如何修改它以满足我的需求。问题是你不能做一个热点&#34;首先&#34;值为{a,b,c,...}的列,然后是#hot;&#34; second&#34;列也可能具有值{a,b,c,...}并按值标记这些列。我会有一个名字冲突。 Pandas One hot encoding: Bundling together less frequent categories
答案 0 :(得分:6)
df = pd.DataFrame(dict(
list('abcdefghij'), 1000,
p=np.arange(10, 0, -1) / 55
threshold = 60
counts = df.name.value_counts()
a 197
b 166
c 139
d 119
f 107
e 105
g 72
h 53
i 27
j 15
Name: name, dtype: int64
repl = counts[counts <= threshold].index
print(pd.get_dummies(df.name.replace(repl, 'uncommon')))
a b c d e f g uncommon
0 0 0 1 0 0 0 0 0
1 0 0 1 0 0 0 0 0
2 0 0 1 0 0 0 0 0
3 0 0 1 0 0 0 0 0
4 0 0 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0
6 0 0 0 0 0 0 1 0
7 0 0 0 0 0 1 0 0
8 0 0 0 0 0 1 0 0
9 0 0 0 0 0 1 0 0
10 0 0 0 0 0 0 0 1
11 0 0 0 0 0 0 1 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
14 0 0 0 0 1 0 0 0
15 1 0 0 0 0 0 0 0
16 1 0 0 0 0 0 0 0
17 0 1 0 0 0 0 0 0
答案 1 :(得分:1)
这是我提出的解决方案。基本上,我需要了解索引是什么以及如何修改它。复合条件筛选或映射无效,并给出了不明确的错误消息。我创建了一个&lt; .T1的索引,然后是&lt; .T2。要获得&gt; .T2和&lt; .T1的组合[注意我因为奇怪的格式错误而添加了句点],我只需要设置差异。然后,奇迹般地,索引(值的序列)似乎用罕见/不常见替换目标值,并且get_dummies执行单热编码。
def onehot2(df, threshold_uncommon, threshold_rare, column, prefix, normalize=False):
freqencies = df[column].value_counts( sort=False, normalize=normalize)
idx1 = freqencies[freqencies < threshold_uncommon].index
idx2 = freqencies[freqencies < threshold_rare].index
idx1 = idx1.difference(idx2)
tmp = df
tmp[column] = df[column].replace(idx1, 'uncommon') if idx1.shape[0] > 0 else df
tmp[column] = tmp[column].replace(idx2, 'rare') if idx2.shape[0] > 0 else tmp
d = pd.get_dummies(tmp, columns=[column], prefix=prefix, dummy_na=True)#
return d
def onehot(df, threshold, column, prefix, normalize=False):
freqencies = df[column].value_counts( sort=False, normalize=normalize)
idx = freqencies[freqencies < threshold].index
tmp = df
if idx.shape[0] > 0:
tmp[column] = df[column].replace(idx, 'uncommon')
tmp = df
d = pd.get_dummies(tmp, columns=[column], prefix=prefix, dummy_na=True)#
return d