因此,我目前正在从事一个项目,该项目包括一年中两个时期的库存总数据集。
我需要执行MapReduce分析,以提供至少3个有关合并数据集的有趣见解。数据存储为CSV(UTF-8)文件。
在大学的一个模块中教给我的方式是使用Python代码运行一个映射器,然后运行一个reducer。
我试图运行与另一个项目相同的代码,但无济于事,我想知道是否有人可以帮助解决此问题,或者是否有其他替代方法可以在Python中做到这一点。
我将代码以及CSV文件中的标头以及每个列分别留给了mapper和reducer。字符,整数或双精度
映射器代码
#!/usr/bin/env python
import sys
# Mapper to return local top 10 cars by Opening Stock Value
# Data source
# Microsoft End Of Year Stock: https://www.nasdaq.com/symbol/msft/historical
# Microsoft Start of Year: https://finance.yahoo.com/quote/MSFT/history?period1=1514764800&period2=1522537200&interval=1d&filter=history&frequency=1d
# Data header: Date(char) High(double) Low(double) Open(double) Close(double) Date(char) High(double) Low(double) Open(double) Close(double)
# Initialise a list to store the top N records as a collection of opening
myList = []
n = 10 # Number of top N records
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split data values into list
data = line.split("\t")
# convert Open (currently a Double) back to Double
try:
Open = float(data[4])
except ValueError:
# ignore/discard this line
continue
# add (Open, record) touple to list
myList.append( (Open, line) )
# sort list in reverse order
myList.sort(reverse=True)
# keep only first N records
if len(myList) > n:
myList = myList[:n]
# Print top N records
for (k,v) in myList:
print(v)
Map Reducer代码
#!/usr/bin/env python
import sys
# Reducer to return overall top N records
# Data source
# Microsoft End Of Year Stock: https://www.nasdaq.com/symbol/msft/historical
# Microsoft Start of Year: https://finance.yahoo.com/quote/MSFT/history?period1=1514764800&period2=1522537200&interval=1d&filter=history&frequency=1d
# Data header: Date(char) High(double) Low(double) Open(double) Close(double) Date(char) High(double) Low(double) Open(double) Close(double)
# Initialise a list to store the top N records as a collection of opening stock values (open, record)
myList = []
n = 10 # Number of top N records
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split data values into list
data = line.split("\t")
# convert Open (currently a Double) back to Double
try:
Open = float(data[4])
except ValueError:
# ignore/discard this line
continue
# add (Open, record) touple to list
myList.append( (Open, line) )
# sort list in reverse order
myList.sort(reverse=True)
# keep only first N records
if len(myList) > n:
myList = myList[:n]
# Print top N records
for (k,v) in myList:
print(v)