使用Mapper和Reducer在Python中排名前10的记录

时间:2018-12-18 22:03:24

标签: python mapreduce data-analysis mapper

因此,我目前正在从事一个项目,该项目包括一年中两个时期的库存总数据集。

我需要执行MapReduce分析,以提供至少3个有关合并数据集的有趣见解。数据存储为CSV(UTF-8)文件。

在大学的一个模块中教给我的方式是使用Python代码运行一个映射器,然后运行一个reducer。

我试图运行与另一个项目相同的代码,但无济于事,我想知道是否有人可以帮助解决此问题,或者是否有其他替代方法可以在Python中做到这一点。

我将代码以及CSV文件中的标头以及每个列分别留给了mapper和reducer。字符,整数或双精度

映射器代码

#!/usr/bin/env python
import sys

# Mapper to return local top 10 cars by Opening Stock Value
# Data source
# Microsoft End Of Year Stock: https://www.nasdaq.com/symbol/msft/historical 
# Microsoft Start of Year: https://finance.yahoo.com/quote/MSFT/history?period1=1514764800&period2=1522537200&interval=1d&filter=history&frequency=1d

# Data header: Date(char) High(double) Low(double) Open(double) Close(double) Date(char) High(double) Low(double) Open(double) Close(double)

# Initialise a list to store the top N records as a collection of opening 


myList = []
n = 10  # Number of top N records

for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split data values into list
    data = line.split("\t")

    # convert Open (currently a Double) back to Double
    try:
       Open = float(data[4])
    except ValueError:
       # ignore/discard this line
       continue

    # add (Open, record) touple to list
    myList.append( (Open, line) )
    # sort list in reverse order
    myList.sort(reverse=True)

    # keep only first N records
    if len(myList) > n:
        myList = myList[:n]

# Print top N records
for (k,v) in myList:
    print(v)

Map Reducer代码

#!/usr/bin/env python
import sys

# Reducer to return overall top N records
# Data source
# Microsoft End Of Year Stock: https://www.nasdaq.com/symbol/msft/historical 
# Microsoft Start of Year: https://finance.yahoo.com/quote/MSFT/history?period1=1514764800&period2=1522537200&interval=1d&filter=history&frequency=1d

# Data header: Date(char) High(double) Low(double) Open(double) Close(double) Date(char) High(double) Low(double) Open(double) Close(double)

# Initialise a list to store the top N records as a collection of opening stock values (open, record)


myList = []
n = 10  # Number of top N records

for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split data values into list
    data = line.split("\t")

# convert Open (currently a Double) back to Double
try:
    Open = float(data[4])
 except ValueError:
     # ignore/discard this line
     continue

     # add (Open, record) touple to list
     myList.append( (Open, line) )
     # sort list in reverse order
     myList.sort(reverse=True)

     # keep only first N records
    if len(myList) > n:
        myList = myList[:n]

# Print top N records
for (k,v) in myList:
    print(v)

0 个答案:

没有答案