我希望能够获取一个字典列表(记录),其中一些列的值列表作为单元格的值。这是一个例子
[{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}]
如何获取此输入并对其执行功能哈希(在我的数据集中,我有数千列)。目前我正在使用一个热门编码,但这似乎消耗了很多内存(超过了我在系统上的内容)。
我尝试使用上面的数据集并收到错误:
x__ = h.transform(data)
Traceback (most recent call last):
File "<ipython-input-14-db4adc5ec623>", line 1, in <module>
x__ = h.transform(data)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform
_hashing.transform(raw_X, self.n_features, self.dtype)
File "sklearn/feature_extraction/_hashing.pyx", line 52, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:2103)
TypeError: a float is required
我还尝试将其转换为数据帧并将其传递给hasher:
x__ = h.transform(x_y_dataframe)
Traceback (most recent call last):
File "<ipython-input-15-109e7f8018f3>", line 1, in <module>
x__ = h.transform(x_y_dataframe)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform
_hashing.transform(raw_X, self.n_features, self.dtype)
File "sklearn/feature_extraction/_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1928)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 138, in <genexpr>
raw_X = (_iteritems(d) for d in raw_X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems
return d.iteritems() if hasattr(d, "iteritems") else d.items()
AttributeError: 'unicode' object has no attribute 'items'
任何想法如何用pandas或sklearn实现这个?或者也许我可以一次构建几千行虚拟变量?
以下是我如何使用pandas获取虚拟变量:
def one_hot_encode(categorical_labels):
res = []
tmp = None
for col in categorical_labels:
v = x[col].astype(str).str.strip('[]').str.get_dummies(', ')#cant set a prefix
if len(res) == 2:
tmp = pandas.concat(res, axis=1)
del res
res = []
res.append(tmp)
del tmp
tmp = None
else:
res.append(v)
result = pandas.concat(res, axis=1)
return result
答案 0 :(得分:1)
考虑以下方法:
from sklearn.feature_extraction.text import CountVectorizer
lst = [{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}]
df = pd.DataFrame(lst)
vect = CountVectorizer()
X = vect.fit_transform(df.fruit.map(lambda x: ' '.join(x) if isinstance(x, list) else x))
r = pd.DataFrame(X.A, columns=vect.get_feature_names(), index=df.index)
df.join(r)
结果:
In [66]: r
Out[66]:
apple banana
0 1 0
1 1 1
In [67]: df.join(r)
Out[67]:
age fruit apple banana
0 27 apple 1 0
1 32 [apple, banana] 1 1
UPDATE:从Pandas 0.20.1开始我们可以直接从稀疏矩阵创建SparseDataFrame:
In [13]: r = pd.SparseDataFrame(X, columns=vect.get_feature_names(), index=df.index, default_fill_value=0)
In [14]: r
Out[14]:
apple banana
0 1 0
1 1 1
In [15]: r.memory_usage()
Out[15]:
Index 80
apple 16 # 2 * 8 byte (np.int64)
banana 8 # 1 * 8 byte (as there is only one `1` value)
dtype: int64
In [16]: r.dtypes
Out[16]:
apple int64
banana int64
dtype: object