Question

我有一个这样的文件：

0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ，0,0,0,0,0,0,0,0,0,0,0,0 1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0， 0,0,0,0,0,0,0,0,0,0,0 2,1,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0， 0,0,0,0,0,0,0,0,0,0,0 3,1,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0， 0,0,0,0,0,0,0,0,0,0,0 4,1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0， 0,0,0,0,0,0,0,0,0,0,0

我想让第一个项目，密钥和其余项目的价值，它们的数组。我的代码不起作用：

mRDD = rRDD.map(lambda line: (line[0], (np.array(int(line))))).collect()

我想要的输出：

(3, (1,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))

(4, (1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))

我的最后一种方法：

import os.path
import numpy as np
baseDir = os.path.join('data')
inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
fileName = os.path.join(baseDir, inputPath)

reglasRDD = (sc.textFile(fileName, 8)
               .cache()
            )
regRDD = reglasRDD.map(lambda line: line.split('\n'))
print regRDD.take(5)

movRDD = regRDD.map(lambda line: (line[0], (int(x) for x in line[1:] if x))).collect()
print movRDD.take(5)

错误：

PicklingError: Can't pickle <type 'generator'>: attribute lookup __builtin__.generator failed

感谢任何帮助。

Answer 1

最后我有解决方案：

    import os.path
    import numpy as np
    baseDir = os.path.join('data')
    inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
    fileName = os.path.join(baseDir, inputPath)    
    split_regex = r'\W+'

    def tokenize(string):
        """ An implementation of input string tokenization
        Args:
            string (str): input string
        Returns:
            list: a list of tokens
        """
        s = re.split(split_regex, string)
        return [int(word) for word in s if word]


    reglasRDD = (sc.textFile(fileName, 8)
                   .map(tokenize)
                   .cache()
                )

    movRDD = reglasRDD.map(lambda line: (line[0], (line[1:])))
    print movRDD.take(5)

输出：

[（0，[1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0， 0,0,0,0,0,0,0,0,0,0,0,0,0]，（1，[1,1,1,0,0,1,1,0， 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0，0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0]），（2，[1,1,0,0,0,0,0,0,0]），（3，[1,1,1,0,0,0,1,0， 0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0， 0,0]，（4，[1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0， 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0））]

谢谢！

Answer 2

我不确定rRDD.map().collect()部分，但您可以使用np.genfromtxt()轻松读取文件，并使用字典理解进行映射。

data_array = np.genfromtxt('data.csv', delimiter=',')
data_dict = {first:rest for first, *rest in data_array}

for循环将迭代数组的行（文件的每一行）。解包用于将第一个元素分配给first，将行的其余部分分配给rest。请注意，这是Python 3中的新功能！如果使用Python 2，您可以稍微改变dict理解：

data_dict = {row[0]:row[1:] for row in data_array}

Answer 3

下面的（非优化）代码可能会让您走上正确的道路：

with open("tmp.txt", "r") as f:
    for line in f:
        line = line.strip()
        first = int(line[0])
        rest = line[1:].split(",")
        rest = tuple([int(x) for x in rest if x])
        tup = (first,(rest))
        print tup

python mapreduce将文本转换为数组

3 个答案: