Question

我的原始数据集是一个JSON对象的大型列表，用于对药物的不良反应。在每个JSON对象中，我们可以使用几种药物以rxcui id的形式产生不良反应。

我已经获取了JSON对象的列表并提取了我们需要的重要数据，例如，如果该人死亡，并且rxcui并将它们展平为最多2级深度的JSON对象。我们会有这样的事情：

{
  "serious": 1,
  "drug": [
    "DrugA",
    "DrugB",
    "DrugC"
  ],
  "rxcui": [
    100,
    200,
    300
  ]
}

我必须弄清楚如何将其转换为数组，我可以提供给ML算法。所以我的想法是使用单热编码。

这就是为什么我使用countVectorizer所以我可以矢量化所有这些子列表

我正在尝试连接几个pandas数据帧（一些是稀疏数据帧，一些是常规数据帧），它们是一些数据的单一编码。我已经检查了所有文件（我还将它们腌制到硬盘上）并且没有大于81MB的文件。但是当我开始连接它们时，它们会爆炸到超过29 GB。怎么可能？

我的所有df看起来都像这样：

Label0  Label1  Label2  Label3...  Label999
1       1       0       0     ...  0
1       1       0       1     ...  1
.
.
.

我像这样运行concat：

x = pandas.concat([x, drugcharacterization, occurcountry, reactionmeddrapt, reactionmeddraversionpt, reactionoutcome, rxcui],axis=1, copy=False)

我也可以适应我试图在内存中轻松连接的所有子数据帧。一旦我这样做，它会爆炸的原因是什么？

修改以下是我获取数据帧的方法。我们可以看到我无法创建其中一个的稀疏矩阵，它给了我一个错误：

raise ValueError("empty vocabulary; perhaps the documents only contain stop words")

import pandas
from sklearn.feature_extraction.text import CountVectorizer    
rr = pandas.DataFrame()
for col in categorical_labels:
    print col
    try:
        vect = CountVectorizer()
        X = vect.fit_transform(z[col].astype(str).map(lambda x: ' '.join(x) if isinstance(x, list) else x))
        r = pandas.SparseDataFrame(X, columns=vect.get_feature_names(), index=z.index, default_fill_value=0).add_prefix(col + ' = ')
        r.to_pickle(col + '_subarr.pkl')
    except:
        r = z[col].astype(str).str.strip('[]').str.get_dummies(', ').add_prefix(col + ' = ')
        r.to_pickle(col + '_subarr.pkl')

    rr = pandas.concat([rr,r], axis=1)

以下是他们的不满：

drugcharacterization.index
Out[13]: RangeIndex(start=0, stop=234372, step=1)

occurcountry.index
Out[14]: RangeIndex(start=0, stop=234372, step=1)

reactionmeddrapt.index
Out[15]: RangeIndex(start=0, stop=234372, step=1)

reactionmeddraversionpt.index
Out[16]: RangeIndex(start=0, stop=234372, step=1)

reactionoutcome.index
Out[17]: RangeIndex(start=0, stop=234372, step=1)

rxcui.index
Out[18]: RangeIndex(start=0, stop=234372, step=1)

Answer 1

AFAIK In [118]: df Out[118]: text 0 With free-text, each letter is actually an ind... 1 As far as the computer is concerned 2 no individual letter or number has any relatio... In [119]: another Out[119]: a b c 0 10 23 87 1 12 45 32 2 14 76 89生成一个新的常规（非sparsed）DataFrame。

考虑以下示例（我使用了Pandas 0.20.1 ）：

来源DF：

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(stop_words='english')

# one-hot-encoded
# for Pandas version < 0.20.1 use: vect.fit_transform(df.text).A
ohe = pd.SparseDataFrame(vect.fit_transform(df.text),
                         columns=vect.get_feature_names(),
                         index=df.index,
                         default_fill_value=0)

让我们对文字进行单热编码：

In [127]: ohe
Out[127]:
   actually  computer  concerned  far  free  independent  individual  letter  number  object  relationship  text
0         1         0          0    0     1            1           0       1       0       1             0     1
1         0         1          1    1     0            0           0       0       0       0             0     0
2         0         0          0    0     0            0           1       2       2       0             1     0

In [128]: ohe.memory_usage()
Out[128]:
Index           80
actually         8
computer         8
concerned        8
far              8
free             8
independent      8
individual       8
letter          16
number           8
object           8
relationship     8
text             8
dtype: int64

结果 - SparseDataFrame（注意内存使用情况）：

In [129]: r = pd.concat([another, df, ohe], axis=1)

In [130]: r
Out[130]:
    a   b   c                                               text  actually  computer  concerned  far  free  independent  individual  \
0  10  23  87  With free-text, each letter is actually an ind...         1         0          0    0     1            1           0
1  12  45  32                As far as the computer is concerned         0         1          1    1     0            0           0
2  14  76  89  no individual letter or number has any relatio...         0         0          0    0     0            0           1

   letter  number  object  relationship  text
0       1       0       1             0     1
1       0       0       0             0     0
2       2       2       0             1     0

In [131]: r.memory_usage()
Out[131]:
Index           80
a               24
b               24
c               24
text            24
actually        24
computer        24
concerned       24
far             24
free            24
independent     24
individual      24
letter          24
number          24
object          24
relationship    24
text            24
dtype: int64

让这个SparseDataFrame与源DF（常规的）连接起来：

pd.concat()

注意： In [149]: from scipy import sparse In [150]: r = pd.SparseDataFrame(sparse.hstack([ohe, another]), columns=ohe.columns.append(another.columns)) In [151]: r.memory_usage() Out[151]: Index 80 actually 8 computer 8 concerned 8 far 8 free 8 independent 8 individual 8 letter 16 number 8 object 8 relationship 8 text 8 a 24 b 24 c 24 dtype: int64已经创建了一个新的常规DataFrame，因此所有＆＃34; sparsed＆＃34;列被批评......

对于纯数字SparseDataFrames或SparseArrays，我们可以使用scipy.sparse.hstack([...])：

{{1}}

Answer 2

我怀疑你的数据帧不共享索引，因此你构建的数据帧比你预期的要大得多。

例如，请考虑以下事项：

df1 = pd.DataFrame({'x': [1, 2, 3]}, index=[0, 1, 2])
df2 = pd.DataFrame({'y': [2, 4, 6]}, index=[3, 4, 5])

print(pd.concat([df1, df2], axis=1))
     x    y
0  1.0  NaN
1  2.0  NaN
2  3.0  NaN
3  NaN  2.0
4  NaN  4.0
5  NaN  6.0

这里我们加入两个数据帧，结果是输入的4倍，因为索引不是共享的。对于您的7个数据帧，如果没有共享索引，则可能会有比连接大小大50倍的连接结果。

如果没有更多信息，我无法确定您的情况发生了什么，但这是我开始调查的地方。

Python - Pandas数据帧连接内存中的气球

2 个答案: