带字符串的csv到稀疏矩阵

时间:2018-07-23 20:12:51

标签: python csv matrix scipy

我有一个具有以下格式的csv文件:

id,category
1x,restaurant
1x,café
2y,café
2y,indian restaurant
3z,italian restaurant

,我想要一个稀疏的矩阵,它具有行ID和列类别。

例如:

    restaurant - café - indian - italian
1x   1          - 1    -    0   -   0
2y   0          - 1    -    1   -   0
3z   0          - 0    -    0   -   1

也许有必要创建一个到矩阵的int键以及id和category字符串的映射。

我需要这个矩阵来使用从sklearn.metrics.pairwise导入cosine_similarity来计算cosine_similarity。

谢谢!

编辑:

我已经编写了这段代码。 mapping_categories.csv包含以下形式的行:

0,cat1 1,cat2 2,cat3 ...

item_category_file_path包含csv ID,类别。

此解决方案给了我一个MemoryError。

mapping_categories = {}
with open("mapping_categories.csv") as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        mapping_categories[row[1]] = int(row[0].rstrip())

item_category = defaultdict(list)

with open(item_category_file_path) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row['category'] in mapping_categories:
            item_category[row['business_id']].append(row['category'])
mapping_items = {}

item_number = 0
for item in item_category:
    mapping_items[item] = item_number
    item_number += 1

matrix_item_category = [0] * len(mapping_items)
for item in item_category:
    for category in item_category[item]:
        matrix_item_category[mapping_items[item]] = [0] * len(mapping_categories)
        matrix_item_category[mapping_items[item]][mapping_categories[category]] = 1

A_sparse = sparse.csr_matrix(matrix_item_category)

item_sim=cosine_similarity(A_sparse)

此解决方案给我这个错误:

  File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 79, in __init__
    self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
  File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 32, in __init__
    arg1 = arg1.asformat(self.format)
  File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/base.py", line 287, in asformat
    return getattr(self, 'to' + format)()
  File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/coo.py", line 342, in tocsr
    data = np.empty_like(self.data, dtype=upcast(self.dtype))
  File "/home/fily1212/.local/lib/python3.6/site-packages/scipy/sparse/sputils.py", line 51, in upcast
    raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('O'),)

-

mapping_categories = {}
with open("mapping_categories.csv") as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        mapping_categories[row[1]] = int(row[0].rstrip())

item_category = defaultdict(list)

with open(item_category_file_path) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row['category'] in mapping_categories:
            item_category[row['business_id']].append(row['category'])

mapping_items = {}

matrix_item_category = {}
item_number = 0
for item in item_category:
    mapping_items[item] = item_number
    for category in item_category[item]:
        matrix_item_category[item_number] = [0] * len(mapping_categories)
        matrix_item_category[item_number][mapping_categories[category]] = 1
    item_number += 1

A_sparse = sparse.csr_matrix(matrix_item_category)

item_sim=cosine_similarity(A_sparse)

1 个答案:

答案 0 :(得分:0)

我不确定为什么将矩阵称为“稀疏”。描述和显示的方式很密集。

将文件读入熊猫数据框:

import pandas as pd
df = pd.read_csv("your_file_name", sep=",")

向数据框添加一个虚拟变量:

df['dummy'] = 1

基于id和类别声明一个新索引,使用unstack将矩阵从“高”转换为“正方形”,然后清除缺失的值:

mtx = df.set_index(['id','category']).unstack().fillna(0).astype(int)

最后,修复列标题:

mtx.columns = mtx.columns.levels[1]
mtx
#category  café indian restaurant italian restaurant restaurant
#id                                                            
#1x           1                 0                  0          1
#2y           1                 1                  0          0
#3z           0                 0                  1          0