我有一个数据框,我需要搜索一个列中存在的值,即另一列中的StringType,即ArrayType,但是我想从第二列中选择值,直到从第一列的第一个出现到数组中的最后一个值柱。
下面举例说明:
输入DF如下:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]
输出DF应该如下所示:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]
答案 0 :(得分:3)
从Spark 2.4开始,您可以使用for (let i = 0; i <= days * 24; i += 6) {
hourTickValues.push(firstPoint.clone().add(i, 'hours'));
hourTickValues[hourTickValues length - 1].hours(i % 24);
}
和array_position
函数:
slice
请仅将其翻译为您的df姓氏。希望这会有所帮助。
答案 1 :(得分:0)
这就是我想要的,我也在虚拟数据上实现了它
import pyspark.sql.types as T
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([['E101',["E101", "E102", "E103", "E104", "E105"]]],["eid", "mapped_eid"])
df.persist()
df.show(truncate = False)
+----+------------------------------+
|eid |mapped_eid |
+----+------------------------------+
|E101|[E101, E102, E103, E104, E105]|
+----+------------------------------+
@F.udf(returnType=T.ArrayType(T.StringType()))
def find_element(element,temp_list):
count = 0
res = []
for i in range(len(temp_list)):
if (count == 0) and (temp_list[i] != element):
count = 1
res.append(temp_list[i])
elif count == 1:
res.append(temp_list[i])
return res
df.withColumn(
"res_col",
find_element(F.col("eid"), F.col("mapped_eid"))
).show(truncate = False)
+----+------------------------------+------------------------+
|eid |mapped_eid |res_col |
+----+------------------------------+------------------------+
|E101|[E101, E102, E103, E104, E105]|[E102, E103, E104, E105]|
+----+------------------------------+------------------------+
让我知道这是否适合您。