Question

给出这样的tsv文件：

doc_id/query_id 1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
1000001 0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1000002 0   0   0   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

第一行是标题角色，其中doc_id/query_id是第一列标题，并且是[1,150]中的150整数。

值行由第一列中的ID和零或其他列组成。

目标是提取非零值的ID对和列名，例如给定上方期望的两行数据，则为：

数据中有80万行，因此我尝试使用pandas，而避免使用sframe：

import turicreate as tc
from tqdm import tqdm

df = tc.SFrame('data.tsv')

with open('ground_truth.non-zeros.tsv', 'w') as fout:
    for i in tqdm(range(len(df))):
        for j in range(1,151):
            if df[i][str(j)]:
                print(df[i]['doc_id/query_id', j)

是否有一种更简单的方法来提取非零值和行ID？

熊猫解决方案或其他数据框解决方案也受到赞赏！请说明限制（如果已知），如果有的话=）

Answer 1

这是一种使用stack和query的简单方法：

(df.set_index('doc_id/query_id')
   .stack()
   .to_frame('tmp')
   .query('tmp == 1')
   .index
   .values)

array([(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')],
      dtype=object)

这是一种优雅，性能优先的方法。

您也可以从numpy开始，这是为了获得最佳性能。

arr = np.loadtxt(filename, skiprows=1, usecols=np.r_[1:151], dtype=int)
index = np.loadtxt(filename, skiprows=1, usecols=[0], dtype=int)

r, c = np.where(arr)
np.column_stack([index[r], c+1])

array([[1000001,       4],
       [1000001,       9],
       [1000002,       7],
       [1000002,       8]])

Answer 2

这是基于numpy的一种方法，我认为应该稍微加快整个过程

t,v=np.where(df.iloc[:,1:]==1)
list(zip(df['doc_id/query_id'].iloc[t],df.columns[v+1]))
Out[135]: [(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')]

Answer 3

一个非大熊猫的答案，您可以遍历文件，并在必要时获取列：

results = []

with open('yourfile.csv') as fh:
    headers = next(fh).split()
    for line in fh:
        _id, *line = line.split()
        non_zero = [{_id: header} for header, val in zip(headers[1:], line) if val!="0"]
        results.extend(non_zero)

# Where you now have the option to throw it into whatever data structure you want
results

[{'1000001': '4'}, {'1000001': '9'}, {'1000002': '7'}, {'1000002': '8'}]

这样，即使您确实为list.extend操作付费，也不会将整个文件加载到内存中，只是抓住了需要的东西

如何提取具有非零列值的行？

3 个答案: