我想应用map-reduce来处理python中的矩阵乘法与Hadoop。目标是计算A * B.输出应与输入类似。
输入是两个矩阵A和B格式,如下所示:
A,0,0,0.0
A,0,1,1.0
...
A,1,3,8.0
A,1,4,9.0
B,0,0,0.0
B,0,1,1.0
...
B,4,0,12.0
B,4,1,13.0
A,0,0,0.0表示索引为A(0,0),值为0.0,B表示相同。
这是我的地图功能:
import sys
import string
import numpy
#Split line into array of entry data
entry = line.split(",")
# Set row, column, and value for this entry
row = int(entry[1])
col = int(entry[2])
value = float(entry[3])
#If this is an entry in matrix A...
if (entry[0] == "A"):
#Generate the necessary key-value pairs
for i in range(col):
print('<{}{},{} {} {}}>'.format(row,i,A,col,value))
#Otherwise, if this is an entry in matrix B...
else:
#Generate the necessary key-value pairs
for i in range(row):
print('<{}{},{} {} {}}>'.format(i,col,B,row,value))
我想知道如何编写reduce函数。 这是我将要使用的框架:
import sys
import string
import numpy
#number of columns of A/rows of B
n = int(sys.argv[1])
#Create data structures to hold the current row/column values (if needed; your code goes here)
currentkey = None
# input comes from STDIN (stream data that goes to the program)
for line in sys.stdin:
#Remove leading and trailing whitespace
line = line.strip()
#Get key/value
key, value = line.split('\t',1)
#Parse key/value input (your code goes here)
#If we are still on the same key...
if key==currentkey:
#Process key/value pair (your code goes here)
#Otherwise, if this is a new key...
else:
#If this is a new key and not the first key we've seen
if currentkey:
#compute/output result to STDOUT (your code goes here)
currentkey = key
#Process input for new key (your code goes here)
#Compute/output result for the last key (your code goes here)
要运行这两个函数,我将使用一个小测试数据集使用以下代码测试它们:
cat smalltest.txt | python src/map.py 2 3 | sort -n | python src/reduce.py 5
Map给出输出,然后使用sort -n
对键进行排序,因此我将使用reducer来处理矩阵计算。我的困惑在于编写reducer函数。
答案 0 :(得分:0)
不确定为什么减少
我的numpy
方法(有一些字符串/列表/拉链体操)
strin = '''A,0,0,0.0
A,0,1,1.0
A,1,0,8.0
A,1,1,9.0
B,0,0,0.0
B,0,1,1.0
B,1,0,12.0
B,1,1,13.0'''.split()
lines = [*map(lambda x: x.split(","),strin)]
linesT = [*zip(*lines)]
linesT
[('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
('0', '0', '1', '1', '0', '0', '1', '1'),
('0', '1', '0', '1', '0', '1', '0', '1'),
('0.0', '1.0', '8.0', '9.0', '0.0', '1.0', '12.0', '13.0')]
现在我们可以获得dims,数组A,B的数据
lastA = linesT[0].index("B") - 1
rowsA, colsA = int(linesT[1][lastA]) + 1, int(linesT[2][lastA]) + 1
datA = [*map(float, linesT[3][0:lastA + 1])]
A = np.array(datA).reshape((rowsA, colsA))
A
Out[50]:
array([[ 0., 1.],
[ 8., 9.]])
firstB = lastA + 1
rowsB, colsB = int(linesT[1][-1]) + 1, int(linesT[2][-1]) + 1
datB = [*map(float, linesT[3][firstB::])]
B = np.array(datB).reshape((rowsB, colsB))
B
Out[51]:
array([[ 0., 1.],
[ 12., 13.]])
A @ B
Out[52]:
array([[ 12., 13.],
[ 108., 125.]])
答案 1 :(得分:0)
好吧,生病直截了当,
lines = [*map(lambda x: x.split(","),strin)]
是简化的方法,如果lambda函数本身甚至不在带语法的输入中,那就好像字符串不存在一样 减少它是老实说,你应该感谢,这段代码(不要太苛刻)是凌乱的,所以我不明白为什么你抱怨自动减少..