我有一个数据帧的列值,我在其中接收到如下所示的字符串输入,其中startIndex是每个字符的开头索引,end index是该字符在字符串中出现的结尾,flag是字符本身
+---+------------------+
| id| Values |
+---+------------------+
|01 | AABBBAA |
|02 | SSSAAAA |
+---+------------------+
现在,我想将字符串转换为每行的字典,如下所示:
+---+--------------------+
| id| Values |
+---+--------------------+
|01 | [{"startIndex":0, |
| | "endIndex" : 1, |
| | "flag" : A }, |
| | {"startIndex":2, |
| | "endIndex" : 4, |
| | "flag" : B }, |
| | {"startIndex":5, |
| | "endIndex" : 6, |
| | "flag" : A }] |
|02 | [{"startIndex":0, |
| | "endIndex" : 2, |
| | "flag" : S }, |
| | {"startIndex":3, |
| | "endIndex" : 6, |
| | "flag" : A }] |
+---+--------------------+-
我有伪代码来构架字典,但不确定如何应用它 一次使用所有行,而无需使用循环。还有这样的问题 方法是只有最后一帧字典被所有行覆盖
import re
x = "aaabbbbccaa"
xs = re.findall(r"((.)\2*)", x)
print(xs)
start = 0
output = ''
for item in xs:
end = start + (len(item[0])-1)
startIndex = start
endIndex = end
qualityFlag = item[1]
print(startIndex, endIndex, qualityFlag)
start = end+
答案 0 :(得分:1)
使用 udf() 包装代码逻辑,并使用 to_json() 将结构数组转换为字符串:
from pyspark.sql.functions import udf, to_json
import re
df = spark.createDataFrame([
('01', 'AABBBAA')
, ('02', 'SSSAAAA')
] , ['id', 'Values']
)
# argument `x` is a StringType() over the udf function
# return `row` as a list of dicts
@udf('array<struct<startIndex:long,endIndex:long,flag:string>>')
def set_fields(x):
row = []
for m in re.finditer(r'(.)\1*', x):
row.append({
'startIndex': m.start()
, 'endIndex': m.end()-1
, 'flag': m.group(1)
})
return row
df.select('id', to_json(set_fields('Values')).alias('Values')).show(truncate=False)
+---+----------------------------------------------------------------------------------------------------------------------------+
|id |Values |
+---+----------------------------------------------------------------------------------------------------------------------------+
|01 |[{"startIndex":0,"endIndex":1,"flag":"A"},{"startIndex":2,"endIndex":4,"flag":"B"},{"startIndex":5,"endIndex":6,"flag":"A"}]|
|02 |[{"startIndex":0,"endIndex":2,"flag":"S"},{"startIndex":3,"endIndex":6,"flag":"A"}] |
+---+----------------------------------------------------------------------------------------------------------------------------+