Pyspark:带有RDD标记列表的RDD,每行一个标记

时间:2018-05-04 01:46:40

标签: python list apache-spark pyspark

我有一个包含令牌的列表列表,例如:

ubuntu@ip-172-31-84-213:~$ systemctl status apache2.service
● apache2.service - LSB: Apache2 web server
   Loaded: loaded (/etc/init.d/apache2; bad; vendor preset: enabled)
  Drop-In: /lib/systemd/system/apache2.service.d
           └─apache2-systemd.conf
   Active: failed (Result: exit-code) since Fri 2018-05-04 01:35:16 UTC; 13s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 6697 ExecStart=/etc/init.d/apache2 start (code=exited, status=1/FAILURE)

May 04 01:35:16 ip-172-31-84-213 apache2[6697]:  * The apache2 configtest failed.
May 04 01:35:16 ip-172-31-84-213 apache2[6697]: Output of config test was:
May 04 01:35:16 ip-172-31-84-213 apache2[6697]: AH00526: Syntax error on line 42 of /etc/apache2/sites-enabled/000-default.conf:
May 04 01:35:16 ip-172-31-84-213 apache2[6697]: Invalid command 'WSGIScriptAlias', perhaps misspelled or defined by a module not included
May 04 01:35:16 ip-172-31-84-213 apache2[6697]: Action 'configtest' failed.
May 04 01:35:16 ip-172-31-84-213 apache2[6697]: The Apache error log may have more information.
May 04 01:35:16 ip-172-31-84-213 systemd[1]: apache2.service: Control process exited, code=exited status=1
May 04 01:35:16 ip-172-31-84-213 systemd[1]: Failed to start LSB: Apache2 web server.
May 04 01:35:16 ip-172-31-84-213 systemd[1]: apache2.service: Unit entered failed state.
May 04 01:35:16 ip-172-31-84-213 systemd[1]: apache2.service: Failed with result 'exit-code'.
lines 1-18/18 (END)

我很难找到导致RDD的操作,其中每一行都是一个令牌。我想要的输出是:

mylist = [['hello'],
          ['cat'],
          ['dog'],
          ['hey'],
          ['dog'],
          ['I', 'need', 'coffee'],
          ['dance'],
          ['dream', 'job']]

myRDD = sc.parallelize(mylist)

这是什么正确的语法?谢谢

1 个答案:

答案 0 :(得分:2)

只需flatMap

myRDD.flatMap(lambda xs: ([x] for x in xs))