我有一个带有文本列的数据框,如下所示:
<!doctype html>
<html>
<head>
<script type="application/javascript" src="https://code.jquery.com/jquery-3.3.1.min.js"></script>
<script type="application/javascript" src="https://cdn.datatables.net/1.10.19/js/jquery.dataTables.min.js"></script>
<script type="application/javascript" src="test.js"></script>
<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.10.19/css/jquery.dataTables.min.css">
</head>
<body>
<table id="revenueTable"></table>
</body>
</html>
我希望能够去除第三个单词之后的所有内容。我该如何使用pyspark或spark sql执行此操作?
答案 0 :(得分:1)
您可以使用正则表达式提取前三个单词。
df.select(regexp_extract(col("product"), "([^\\s]+\\s+){0,2}[^\\s]+", 0))\
.show(truncate=False)
+--------------------------------------------------+
|regexp_extract(product, ([^\s]+\s+){0,2}[^\s]+, 0)|
+--------------------------------------------------+
|HI Celebrate Cake |
|GO Choc Celebrat |
|BI Chocolate Buttercream |
|Graduation Cake 28 |
|Slab Image Cake |
|Slab Celebration Cake |
|Grain Bread |
+--------------------------------------------------+
答案 1 :(得分:1)
我找到了解决方法:
from pyspark.sql.functions import regexp_extract, col, split
from pyspark.sql import functions as sf
df_test=spark.sql("select * from brand_cleanup")
#Applying the transformations to the data
split_col=split(df_test.item_eng_desc,' ')
df_split=df_test.withColumn('item_desc_clean',sf.concat(split_col.getItem(0),sf.lit(' '),split_col.getItem(1),sf.lit(' '),split_col.getItem(2)))