python mapreduce将文本转换为数组

时间:2015-07-07 18:41:20

标签: python arrays mapreduce apache-spark

我有一个这样的文件:

  

0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0   1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0   2,1,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0   3,1,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0   4,1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0

我想让第一个项目,密钥和其余项目的价值,它们的数组。我的代码不起作用:

mRDD = rRDD.map(lambda line: (line[0], (np.array(int(line))))).collect()

我想要的输出:

(3, (1,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))

(4, (1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))

我的最后一种方法:

import os.path
import numpy as np
baseDir = os.path.join('data')
inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
fileName = os.path.join(baseDir, inputPath)

reglasRDD = (sc.textFile(fileName, 8)
               .cache()
            )
regRDD = reglasRDD.map(lambda line: line.split('\n'))
print regRDD.take(5)

movRDD = regRDD.map(lambda line: (line[0], (int(x) for x in line[1:] if x))).collect()
print movRDD.take(5)

错误:

PicklingError: Can't pickle <type 'generator'>: attribute lookup __builtin__.generator failed

感谢任何帮助。

3 个答案:

答案 0 :(得分:1)

最后我有解决方案:

    import os.path
    import numpy as np
    baseDir = os.path.join('data')
    inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
    fileName = os.path.join(baseDir, inputPath)    
    split_regex = r'\W+'

    def tokenize(string):
        """ An implementation of input string tokenization
        Args:
            string (str): input string
        Returns:
            list: a list of tokens
        """
        s = re.split(split_regex, string)
        return [int(word) for word in s if word]


    reglasRDD = (sc.textFile(fileName, 8)
                   .map(tokenize)
                   .cache()
                )

    movRDD = reglasRDD.map(lambda line: (line[0], (line[1:])))
    print movRDD.take(5)

输出:

  

[(0,[1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0],(1,[1,1,1,0,0,1,1,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0]),(2,[1,1,0,0,0,0,0,0,0]),(3,[1,1,1,0,0,0,1,0, 0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0],(4,[1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))]

谢谢!

答案 1 :(得分:0)

我不确定rRDD.map().collect()部分,但您可以使用np.genfromtxt()轻松读取文件,并使用字典理解进行映射。

data_array = np.genfromtxt('data.csv', delimiter=',')
data_dict = {first:rest for first, *rest in data_array}

for循环将迭代数组的行(文件的每一行)。解包用于将第一个元素分配给first,将行的其余部分分配给rest。请注意,这是Python 3中的新功能!如果使用Python 2,您可以稍微改变dict理解:

data_dict = {row[0]:row[1:] for row in data_array}

答案 2 :(得分:0)

下面的(非优化)代码可能会让您走上正确的道路:

with open("tmp.txt", "r") as f:
    for line in f:
        line = line.strip()
        first = int(line[0])
        rest = line[1:].split(",")
        rest = tuple([int(x) for x in rest if x])
        tup = (first,(rest))
        print tup