过滤出pyspark.RDD中的数据

时间:2019-12-31 02:59:04

标签: scala apache pyspark bigdata rdd

我已经映射了RDD格式的数据

crimesMapped = crimesOnly.map(lambda line: (line.split(",")[0],line.split(",")[1], line.split(",")[2], line.split(",")[3], line.split(",")[4], line.split(",")[5], line.split(",")[6], line.split(",")[7], line.split(",")[8], line.split(",")[9], line.split(",")[10], line.split(",")[11], line.split(",")[12], line.split(",")[13], line.split(",")[14],line.split(",")[17], line.split(",")[18], line.split(",")[21]))

crimesMapped.take(1)

输出:

[('11034701',
  'JA366925',
  '01/01/2001 11:00:00 AM',
  '016XX E 86TH PL',
  '1153',
  'DECEPTIVE PRACTICE',
  'FINANCIAL IDENTITY THEFT OVER $ 300',
  'RESIDENCE',
  'false',
  'false',
  '0412',
  '004',
  '8',
  '45',
  '11',
  '2001',
  '08/05/2017 03:50:08 PM',
  '')
]

我想要的数据在这里:

s = crimesMapped.take(1)
print(s)
print("-------------------------------------------------------------------------------------------------------------------------------")
print(s[0][11])

输出:

[('11034701', 'JA366925', '01/01/2001 11:00:00 AM', '016XX E 86TH PL', '1153', 'DECEPTIVE PRACTICE', 'FINANCIAL IDENTITY THEFT OVER $ 300', 'RESIDENCE', 'false', 'false', '0412', '004', '8', '45', '11', '2001', '08/05/2017 03:50:08 PM', '')]
-------------------------------------------------------------------------------------------------------------------------------
004

我想过滤数据以只给我每个数组第11列中的数据,我该怎么做?

crimesMapped.filter(lambda x: x[][11]) --??

0 个答案:

没有答案