Question

我有300个小型csv文件（每个10-100MB）。我想独立对待每个文件。对于文件的每一行，我想计算接下来的20,000行中某列值的出现，并将其用作该行的标签。我正在考虑同时处理这些文件以加快处理速度的工作。这是我可以使用Spark进行的操作还是应该尝试使用其他操作？ Spark的大多数用途似乎是将数据视为一个大数据集，我不确定我的用例是否可行。

input: directory of files d

concurrently for each file in d:
  for each row r in file:
    count where r['a'] == x['a'] for x in next 20,000 rows
    add this count as a column 'label' in row r

output: same files, but with extra column 'label'

示例（向前看2行，而不是20,000行）：

Input: 
file1: [('a':'pen', 'b':'apple'), ('a':'bike', 'b':'apple'), ('a':'pen', 'b':'bike')] 
file2: [('a':'chair', 'b':'apple'), ('a':'chair', 'b':'pen'), ('a':'chair', 'b':'pen')]


Output: 
file1: [('a':'pen', 'b':'apple', 'label': 2), ('a':'bike', 'b':'apple', 'label':0), ('a':'pen', 'b':'pen', 'label':0)] 
file2: [('a':'chair', 'b':'apple', 'label':3), ('a':'chair', 'b':'pen', 'label':0), ('a':'chair', 'b':'pen', 'label':0)]

在Spark中独立并发地标记文件

0 个答案: