pyspark在条件

时间:2019-06-25 05:56:08

标签: apache-spark pyspark cloudera

我想在第三列中返回大于7的值。使用pyspark中的filter执行这些“ <”将返回一个空数组。当我更改为“>”时,它将返回列中的所有值:

>>> s1 = sc.textFile("/user/ontoS/25.txt").map(lambda line: line.split('\t')).filter(lambda line:(int( float( line[2] <= 7))))
>>> s1.take(5)
[]

以下代码返回所有行:

>>> s1 = sc.textFile("/user/ontoS/25.txt").map(lambda line: line.split('\t')).filter(lambda line:(float( line[2] >= 7)))
>>> s1.collect()
[[u'WEO Country Code', u'Country', u'importsGoodsVolumePercentChange'], 
[u'178', u'Ireland', u'-5.492'], [u'181', u'Malta', u'-1.949'], 
[u'146', u'Switzerland', u'-1.528'], [u'124', u'Belgium', u'-0.752'],
[u'137', u'Luxembourg', u'0.602'], [u'158', u'Japan', u'2.289'],
[u'436', u'Israel', u'2.401'], [u'122', u'Austria', u'3.1'],
[u'939', u'Estonia', u'3.562'], [u'156', u'Canada', u'3.876'], 
[u'144', u'Sweden', u'4.019'], [u'142', u'Norway', u'4.067'], 
[u'112', u'United Kingdom', u'4.079'], [u'936', u'Slovak Republic', u'4.141'], 
[u'111', u'United States', u'4.607'], [u'172', u'Finland', u'4.883'], 
[u'184', u'Spain', u'5.047'], [u'136', u'Italy', u'5.086'],
[u'128', u'Denmark', u'5.37'], [u'132', u'France', u'5.473'],
[u'138', u'Netherlands', u'5.72'], [u'935', u'Czech Republic', u'5.908'], 
[u'528', u'Taiwan Province of China', u'5.959'], [u'134', u'Germany', u'6.023'], 
[u'174', u'Greece', u'6.782'], [u'532', u'Hong Kong SAR', u'6.953'], 
[u'542', u'Korea', u'7.424'], [u'196', u'New Zealand', u'7.583'], 

使用“> 7”时,第三列返回的值大于7,使用“ <7”时,返回小于7的值。

0 个答案:

没有答案