如何使用pyspark和scala在一个时间范围内找到每分钟事件的发生

时间:2018-04-30 00:17:59

标签: python scala pyspark

minutes  count of tweets       
1        100 
2        34 
3        56 
4        234 
5        2310 
6        345 
7        56 
8        55 
9        12 
10       245 

这场比赛有130分钟,我怎样每分钟使用推文ID查找推文数?

预期结果:

SyntaxError: Unexpected token # in JSON at position 0
    at Object.parse (native)
    at createStrictSyntaxError (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/types/json.js:157:10)
    at parse (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/types/json.js:83:15)
    at /Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/read.js:121:18
    at invokeCallback (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:224:16)
    at done (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:213:7)
    at IncomingMessage.onEnd (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:273:7)
    at emitNone (events.js:86:13)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:80:11)
    at process._tickCallback (internal/process/next_tick.js:104:9)
SyntaxError: Unexpected token # in JSON at position 0
    at Object.parse (native)
    at createStrictSyntaxError (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/types/json.js:157:10)
    at parse (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/types/json.js:83:15)
    at /Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/read.js:121:18
    at invokeCallback (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:224:16)
    at done (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:213:7)
    at IncomingMessage.onEnd (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:273:7)
    at emitNone (events.js:86:13)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:80:11)
    at process._tickCallback (internal/process/next_tick.js:104:9)
SyntaxError: Unexpected token # in JSON at position 0
    at Object.parse (native)
    at createStrictSyntaxError (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/types/json.js:157:10)
    at parse (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/types/json.js:83:15)
    at /Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/body-parser/lib/read.js:121:18
    at invokeCallback (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:224:16)
    at done (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:213:7)
    at IncomingMessage.onEnd (/Users/lorenzowhite/Desktop/Work Stuff/projectus/node_modules/raw-body/index.js:273:7)
    at emitNone (events.js:86:13)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:80:11)
    at process._tickCallback (internal/process/next_tick.js:104:9)

1 个答案:

答案 0 :(得分:0)

假设推文ID是唯一的并使用Pyspark和raw rdd:

rdd = sc.parallelize([(1001 ,145678, 145600, 145730),
(1002 ,145678, 145600, 145730),
(1005 ,145680, 145600, 145730), 
(12278 ,145687, 145600, 145730), 
(765558 ,145688, 145600, 145730), 
(724323 ,145689, 145600, 145730), 
(875857 ,145688, 145600, 145730), 
(79375 ,145685, 145600, 145730), 
(84666 ,145686, 145600, 145730), 
(335556 ,145687, 145600, 145730), 
(29990 ,145688, 145600, 145730), 
(56 ,145689, 145600, 145730), 
(968867 ,145690, 145600, 145730), 
(8452 ,145691, 145600, 145730), 
(1334 ,145679, 145600, 145730) ])

result_dict = rdd.filter(lambda x: x[2] <= x[1] <= x[3]).map(lambda x: (x[1] - x[2], 0))\
.countByKey()

print "minutes count of tweets"
for i in sorted(result_dict.iteritems()):
    print "{0}\t{1}".format(i[0], i[1])

结果:

minutes count of tweets
78  2
79  1
80  1
85  1
86  1
87  2
88  3
89  2
90  1
91  1