我正在执行一个在Spark 3.0.0上流式传输Spark结构的示例,为此,我正在使用Twitter数据。 我已经在Kafka中推送了Twitter数据,单条记录看起来像这样
2020-07-21 10:48:19 | 1265200268284588034 | RT @narendramodi:与@IBM首席执行官@ArvindKrishna先生进行了广泛的互动。我们讨论了与技术有关的几个主题,…|印度海德拉巴
此处每个字段均以'|'分隔字段是
约会时间
用户ID
推文
位置
现在在Spark中阅读此消息,我得到了这样的数据帧
key | value
-----+-------------------------
| 2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India
根据this的答案,我在我的应用程序中添加了以下代码块。
split_col = pyspark.sql.functions.split(df['value'], '|')
df = df.withColumn("Tweet Time", split_col.getItem(0))
df = df.withColumn("User ID", split_col.getItem(1))
df = df.withColumn("Tweet Text", split_col.getItem(2))
df = df.withColumn("Location", split_col.getItem(3))
df = df.drop("key")
但是它给了我这样的输出,
A | B | C | D | E |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+--------+-----+
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2 | 0 | 2 | 0 |
但是我想要这样的输出
Tweet Time | User ID | Tweet text | Location |
-----------------------+-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
2020-07-21 10:48:19 | 1265200268284588034 | RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,… | Hyderabad, India |
答案 0 :(得分:1)
因为它接受 模式:代表正则表达式的字符串。正则表达式字符串应为 Java正则表达式。
使用"\\|"
按管道或'[|]'
split_col = split(df.value, '\\|',)
df = df.withColumn("Tweet Time", split_col.getItem(0))\
.withColumn("User ID", split_col.getItem(1))\
.withColumn("Tweet Text", split_col.getItem(2))\
.withColumn("Location", split_col.getItem(3))\
.drop("key")
输出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|value |Tweet Time |User ID |Tweet Text |Location |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+