Question

我是python的新手，正在做一个数据分析项目，并需要一些帮助。我有一个400,000行的数据框，它具有以下列：ID，类型1类别，类型2类别，类型3类别，金额，年龄，欺诈。

类别列是列表列。此列表包含我要采用的不同术语，并创建一个矩阵，该矩阵计数并显示特定术语在该行中出现的次数（每个术语和频率每列）。

所以目标是创建一个稀疏矩阵的数据框，使每个独特的类别都变成一列-我的数据集具有2000多个不同的类别-也许那就是为什么计数向量化器对此不适合？

我尝试了两种方法，一种使用count vectorizer，另一种使用for loops

但是Count Vectorizer每次运行都会崩溃。第二种方法太慢了。因此，我想知道是否仍然可以改善这些解决方案。

我还将数据框分为多个块，但仍然会引起问题


Example:
+------+--------------------------------------------+---------+---------+
|  ID  |             Type 1 Category                | Amount  | Fraud   |
+------+--------------------------------------------+---------+---------+
| ID1  | [Lex1, Lex2, Lex1, Lex4, Lex2, Lex1]       |  110.0  |    0    |
| ID2  | [Lex3, Lex6, Lex3, Lex6, Lex3, Lex1, Lex2] |  12.5   |    1    |
| ID3  | [Lex7, Lex3, Lex2, Lex3, Lex3]             |  99.1   |    0    |
+------+--------------------------------------------+---------+---------+
col = 'Type 1 Category'

# prior to this, I combined the entire dataframe based on ID
# this was from old dataframe where each row had different occurrence of id
# and only one category per row 
terms =  df_old[col].unique() 


countvec = CountVectorizer(vocabulary=terms)

# create bag of words

df = df.join(pd.DataFrame(countvec.fit_transform(df[col]).toarray(),
                              columns=countvec.get_feature_names(),
                              index=df.index))

# drop original column of lists
df = df.drop(col, axis = 1)


##### second split dataframe to chunks using np.split

df_l3 = df_split[3]

output.index = df_l3.index
# Assign the columns.
output[['ID', '[col]']] = df_l3[['ID', '[col]']]

# split dataframe into chunks and 114305 is where the index starts
last = 114305+int(df_l3.shape[0])

for i in range(114305,last):
  print(i)
  for word in words:
      output.ix[i,str(word)] = output[col][i].count(str(word))

计数矢量化器的内存运行，第二个不再计数频率。适用于索引从零开始但不针对其他索引的块1。

使用CountVectorizer从列表的列创建术语频率数据帧时，运行时崩溃

0 个答案: