如何使用map reduce识别炎热和寒冷的天气?

时间:2018-03-26 11:24:17

标签: python python-3.x hadoop mapreduce

我的数据如下:

20130101  12.8   9.6
20130102  10.1   3.8
20130103   7.0  -2.2
20130104  11.8  -3.7
20130105   8.6  -1.1
20130106  10.5   1.9
20130107  13.4  -0.1
20130108  16.2   1.4
20130109  17.8  12.4
20130110  20.0  16.2
20130111  15.4  5.0

我想确定最高温度大于40(炎热的一天)和最低温度低于10(寒冷的一天)的日期。 为此,我运行以下代码:

current_date = None
current_temp = None
for line in data.strip(). split('\n'):
    Mapper_data = ["%s\o%s\o%s" % (line.split('  ')[0], line.split('  ')[1],line.split('  ')[2]) ]
    for line in Mapper_data:
        line = line.strip()
        date, max_temp,min_temp = line.rsplit('\o', 2)
        try:
            max_temp = float(max_temp)
            min_temp = float(min_temp)    
       except ValueError:
            continue
       if current_date == date:
           if max_temp > 40:
                current_temp = 'Hot day'
           if min_temp< 10:
                current_temp = 'Cold day'

      else:
            if current_date:
                print ('%s\t%s' % (current_date, current_temp))
            if max_temp > 40:
               current_temp = 'Hot day' 
            if min_temp< 10:
               current_temp = 'Cold day'
           current_date = date
if current_date == date:
    print ('%s\t%s' % (current_date, current_temp))

我得到以下结果:

20130101    Cold day
20130102    Cold day
20130103    Cold day
20130104    Cold day
20130105    Cold day
20130106    Cold day
20130107    Cold day
20130108    Cold day
20130109    Cold day
20130110    Cold day
20130111    Cold day

但我需要的结果是:

20130101    Cold day
20130102    Cold day
20130103    Cold day
20130104    Cold day
20130105    Cold day
20130106    Cold day
20130107    Cold day
20130108    Cold day
20130111    Cold day

因为20130109和20130110既不冷也不热。

如果您有任何想法我如何更改我的代码以获得最后的结果请帮助。

1 个答案:

答案 0 :(得分:0)

如果你想要一个兼容Hadoop的Python脚本,它需要从STDIN中读取

set.seed(123)
df <- data.frame(name = sample(letters, 100, TRUE),
                 date = sample(1:500, 100, TRUE))
library(dplyr)
filter(df, date < 50) # date less than 50
filter(df, date %in% 50:100) # date between 50 and 100
filter(df, date %in% 1:50 & name == "r") # date between 1 and 50 AND name is "r"
filter(df, date %in% 1:50 | name == "r") # date between 1 and 50 OR name is "r"

# You can also use the pipe (%>%) operator
df %>% filter(date %in% 1:50 | name == "r")

以下是本地运行

的示例
import sys

for line in sys.stdin:
    current_date, max_temp, min_temp = line.split()
    condition = None
    try:
        f_min_temp = float(min_temp)
        f_max_temp = float(max_temp)
    except ValueError:
        continue

    if f_max_temp > 40:
        condition = 'Hot day'
    if f_min_temp < 10:
        condition = 'Cold day'

    if condition:
         print ('%s\t%s' % (current_date, condition))

要在Hadoop中运行,请参阅Hadoop Streaming