我有一个带有时间戳和标题的CSV文件,我希望能够按特定的时间戳或pySpark中的一行进行搜索。
textfile = sc.textFile("data.csv")\
header = textfile.first()
.textfile.filter(lambda line: line != header)
.map(lambda line: (line.split(',')[1], line.split(',')[2]))\
.distinct()\
.max()
我尝试使用Spark SQL,但我无法理解。
示例输入:
Time [-] B1-1 EW AC [m/s^2] B1-2 NS AC [m/s^2] B2-1 EW AC [m/s^2] B2-2 NS AC [m/s^2] B3-1 EW AC [m/s^2] B3-2 NS AC [m/s^2] B4-1 EW AC [m/s^2] B4-2 NS AC [m/s^2] B5-1 EW AC [m/s^2] B5-2 NS AC [m/s^2] B6-1 EW AC [m/s^2]
15:14.1 0.07521612 -0.019558864 -0.004072318 0.057055011 0.033445455 0.10515116 -0.005318701 -0.10593631 -0.06616208 0.067418374 0.007425771
15:14.1 0.012684621 -0.025686748 -0.029669747 -0.015677277 -0.06540639 0.043687206 0.056057423 -0.005557867 -0.026925504 0.1059664 0.031872407
15:14.1 -0.054526106 0.016956611 0.001579062 0.044119116 -0.078679785 -0.1983114 0.096496433 0.02442093 0.020333124 0.025292056 0.022027005
15:14.1 -0.0030546 0.05305237 -0.023935258 0.002741382 0.073090985 -0.16384798 -0.009033349 0.17119914 0.003653608 -0.13548735 0.020024549
15:14.1 -0.034533042 0.077983625 0.018616311 -0.006082441 0.055625994 -0.002599431 -0.084086135 0.021557786 -0.008736889 -0.077502668 -0.076927647
15:14.1 0.056924593 0.037019137 0.044213742 -0.051229578 0.027507361 0.15999076 -0.015196289 -0.1391993 0.06187306 -0.057252757 -0.045555849
15:14.2 0.043737678 0.030471534 -0.038146816 0.024072761 0.003667648 0.27830678 0.040861133 0.010863103 -0.021127386 0.061481655 0.028952161
15:14.2 -0.008159212 -0.050701946 -0.060087472 0.014820596 -0.015980465 -0.034882683 0.09480796 -0.088252187 -0.022715911 0.053105187 0.067666292
15:14.2 -0.046869188 -0.073618554 0.038146816 0.00522576 -0.080775581 -0.13810523 0.05647954 -0.070147015 -0.030420261 0.066605121 0.034709219
15:14.2 -0.043891497 -0.070764467 0.006898009 0.020303361 -0.007422621 -0.049221478 -0.010299707 0.02526303 -0.030102555 -0.1053158 0.019607371
15:14.2 0.030550764 -0.040460825 -0.049532689 -0.031611562 0.068462759 0.030606201 -0.039510351 -0.063578628 0.040110264 -0.049770862 -0.029285904
15:14.2 0.028849226 0.063713208 0.042967115 -0.011136864 -0.015543842 0.038823754 -0.028788526 -0.047915548 0.11072022 0.066605121 -0.047224563
15:14.2 0.062029205 0.096451215 0.051527292 0.042834092 0.007859246 -0.027922917 -0.010721826 -0.049599752 -0.000555984 0.002683723 -0.055734996
15:14.2 -0.003905369 -0.016620837 -0.053605005 0.035295293 -0.012574793 -0.22321562 0.03503589 -0.035620872 -0.087845452 0.033668526 0.075425804
15:14.2 -0.016241515 -0.095359951 -0.080365956 0.045832481 0.00829587 -0.04678975 0.087463088 -0.019536743 -0.032405917 0.10035498 0.10804913
15:14.2 -0.058354565 0.030471534 0.019447397 -0.053799622 -0.050910447 0.18087006 0.098944724 -0.026105132 0.035106409 -0.10767422 0.021693261
15:14.2 0.005027703 0.008730136 0.060835447 0.021074373 0.017726965 -0.015261174 -0.022203466 0.00884206 -0.047496907 -0.010816217 -0.041884683
15:14.2 0.05862613 0.058760535 0.004072318 0.006853455 0.05606262 -0.13558966 -0.07539048 0.080336437 0.005639265 -0.006831295 -0.061825797
预期输出只是一行的最大值。
它一直告诉我sqlContext.createDataFrame()不能接受Unicode中的数据。
我是这些人的新手,所以我真的很感激任何帮助。
谢谢
答案 0 :(得分:1)
使用numpy
:
import re
import numpy as np
(textfile
.filter(lambda line: line != header)
.map(lambda line: np.fromstring(re.split("\s+", line, 1)[1], sep="\t").max())
)
使用标准Python:
(textfile
.filter(lambda line: line != header)
.map(lambda line: max(float(x) for x in line.split()[1:])))