我有一个具有以下格式的csv文件:
id,category
1x,restaurant
1x,café
2y,café
2y,indian restaurant
3z,italian restaurant
,我想要一个稀疏的矩阵,它具有行ID和列类别。
例如:
restaurant - café - indian - italian
1x 1 - 1 - 0 - 0
2y 0 - 1 - 1 - 0
3z 0 - 0 - 0 - 1
也许有必要创建一个到矩阵的int键以及id和category字符串的映射。
我需要这个矩阵来使用从sklearn.metrics.pairwise导入cosine_similarity来计算cosine_similarity。
谢谢!
编辑:
我已经编写了这段代码。 mapping_categories.csv包含以下形式的行:
0,cat1 1,cat2 2,cat3 ...
item_category_file_path包含csv ID,类别。
此解决方案给了我一个MemoryError。
mapping_categories = {}
with open("mapping_categories.csv") as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
mapping_categories[row[1]] = int(row[0].rstrip())
item_category = defaultdict(list)
with open(item_category_file_path) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['category'] in mapping_categories:
item_category[row['business_id']].append(row['category'])
mapping_items = {}
item_number = 0
for item in item_category:
mapping_items[item] = item_number
item_number += 1
matrix_item_category = [0] * len(mapping_items)
for item in item_category:
for category in item_category[item]:
matrix_item_category[mapping_items[item]] = [0] * len(mapping_categories)
matrix_item_category[mapping_items[item]][mapping_categories[category]] = 1
A_sparse = sparse.csr_matrix(matrix_item_category)
item_sim=cosine_similarity(A_sparse)
此解决方案给我这个错误:
File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 79, in __init__
self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 32, in __init__
arg1 = arg1.asformat(self.format)
File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/base.py", line 287, in asformat
return getattr(self, 'to' + format)()
File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/coo.py", line 342, in tocsr
data = np.empty_like(self.data, dtype=upcast(self.dtype))
File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/sputils.py", line 51, in upcast
raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('O'),)
-
mapping_categories = {}
with open("mapping_categories.csv") as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
mapping_categories[row[1]] = int(row[0].rstrip())
item_category = defaultdict(list)
with open(item_category_file_path) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['category'] in mapping_categories:
item_category[row['business_id']].append(row['category'])
mapping_items = {}
matrix_item_category = {}
item_number = 0
for item in item_category:
mapping_items[item] = item_number
for category in item_category[item]:
matrix_item_category[item_number] = [0] * len(mapping_categories)
matrix_item_category[item_number][mapping_categories[category]] = 1
item_number += 1
A_sparse = sparse.csr_matrix(matrix_item_category)
item_sim=cosine_similarity(A_sparse)
答案 0 :(得分:0)
我不确定为什么将矩阵称为“稀疏”。描述和显示的方式很密集。
将文件读入熊猫数据框:
import pandas as pd
df = pd.read_csv("your_file_name", sep=",")
向数据框添加一个虚拟变量:
df['dummy'] = 1
基于id和类别声明一个新索引,使用unstack
将矩阵从“高”转换为“正方形”,然后清除缺失的值:
mtx = df.set_index(['id','category']).unstack().fillna(0).astype(int)
最后,修复列标题:
mtx.columns = mtx.columns.levels[1]
mtx
#category café indian restaurant italian restaurant restaurant
#id
#1x 1 0 0 1
#2y 1 1 0 0
#3z 0 0 1 0