我有一个这样的文件:
0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0 1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0 2,1,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0 3,1,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0 4,1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0
我想让第一个项目,密钥和其余项目的价值,它们的数组。我的代码不起作用:
mRDD = rRDD.map(lambda line: (line[0], (np.array(int(line))))).collect()
我想要的输出:
(3, (1,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
(4, (1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
我的最后一种方法:
import os.path
import numpy as np
baseDir = os.path.join('data')
inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
fileName = os.path.join(baseDir, inputPath)
reglasRDD = (sc.textFile(fileName, 8)
.cache()
)
regRDD = reglasRDD.map(lambda line: line.split('\n'))
print regRDD.take(5)
movRDD = regRDD.map(lambda line: (line[0], (int(x) for x in line[1:] if x))).collect()
print movRDD.take(5)
错误:
PicklingError: Can't pickle <type 'generator'>: attribute lookup __builtin__.generator failed
感谢任何帮助。
答案 0 :(得分:1)
最后我有解决方案:
import os.path
import numpy as np
baseDir = os.path.join('data')
inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
fileName = os.path.join(baseDir, inputPath)
split_regex = r'\W+'
def tokenize(string):
""" An implementation of input string tokenization
Args:
string (str): input string
Returns:
list: a list of tokens
"""
s = re.split(split_regex, string)
return [int(word) for word in s if word]
reglasRDD = (sc.textFile(fileName, 8)
.map(tokenize)
.cache()
)
movRDD = reglasRDD.map(lambda line: (line[0], (line[1:])))
print movRDD.take(5)
输出:
[(0,[1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0],(1,[1,1,1,0,0,1,1,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0]),(2,[1,1,0,0,0,0,0,0,0]),(3,[1,1,1,0,0,0,1,0, 0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0],(4,[1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))]
谢谢!
答案 1 :(得分:0)
我不确定rRDD.map().collect()
部分,但您可以使用np.genfromtxt()
轻松读取文件,并使用字典理解进行映射。
data_array = np.genfromtxt('data.csv', delimiter=',')
data_dict = {first:rest for first, *rest in data_array}
for
循环将迭代数组的行(文件的每一行)。解包用于将第一个元素分配给first
,将行的其余部分分配给rest
。请注意,这是Python 3中的新功能!如果使用Python 2,您可以稍微改变dict理解:
data_dict = {row[0]:row[1:] for row in data_array}
答案 2 :(得分:0)
下面的(非优化)代码可能会让您走上正确的道路:
with open("tmp.txt", "r") as f:
for line in f:
line = line.strip()
first = int(line[0])
rest = line[1:].split(",")
rest = tuple([int(x) for x in rest if x])
tup = (first,(rest))
print tup