我有一个看起来像这样的json:
{
"cols": [
"id",
"value"
],
"data": [
[
1,
"0.10259593440335"
],
[
2,
"0.0061205302736639"
],
[
3,
"-0.36367513456978"
],
[
4,
"0.080167833573921"
],
.
.
.
这是我的使用PySpark读取数据的代码:
import sys
sys.path.insert(0, '.')
from pyspark import SparkContext, SparkConf
def strip(line: str):
if line[-1] == ',':
return float(line[3:-1])
else:
return float(line[4:-1])
if __name__ == "__main__":
conf = SparkConf().setAppName("airports").setMaster("local[*]")
sc = SparkContext(conf = conf)
json = sc.textFile("dataMay-31-2017.json")
jsonCol = json.filter(lambda line: '\t\t\t' in line)
jsonCol = jsonCol.map(strip)
在上一次映射操作之后,我有RDD和包含以下元素的RDD:
[1.0, 0.10259593440335, 2.0, 0.0061205302736639, 3.0, -0.36367513456978, 4.0, 0.080167833573921,...
现在,我想执行一个操作,该操作会给我和两个元组的RDD:
[(1.0, 0.10259593440335), (2.0, 0.0061205302736639), (3.0, -0.36367513456978), (4.0, 0.080167833573921),...
执行此操作的正确方法是什么?
答案 0 :(得分:1)
$