如何将pyspark中列的每一行的输入字符串转换为字典

时间:2019-09-24 17:07:57

标签: python-3.x pyspark pyspark-sql pyspark-dataframes

我有一个数据帧的列值,我在其中接收到如下所示的字符串输入,其中startIndex是每个字符的开头索引,end index是该字符在字符串中出现的结尾,flag是字符本身

    +---+------------------+
    | id|    Values        |
    +---+------------------+
    |01 |  AABBBAA         |
    |02 |  SSSAAAA         |
    +---+------------------+

现在,我想将字符串转换为每行的字典,如下所示:

    +---+--------------------+
    | id|    Values          |
    +---+--------------------+
    |01 |  [{"startIndex":0, |
    |   |    "endIndex" : 1, | 
    |   |    "flag" : A },   |
    |   |   {"startIndex":2, |
    |   |    "endIndex" : 4, |
    |   |    "flag" : B },   |
    |   |   {"startIndex":5, |
    |   |    "endIndex" : 6, |
    |   |    "flag" : A }]   |
    |02 |  [{"startIndex":0, |
    |   |    "endIndex" : 2, |
    |   |    "flag" : S },   |
    |   |   {"startIndex":3, |
    |   |    "endIndex" : 6, |
    |   |    "flag" : A }]   |
    +---+--------------------+-

我有伪代码来构架字典,但不确定如何应用它 一次使用所有行,而无需使用循环。还有这样的问题 方法是只有最后一帧字典被所有行覆盖


        import re
        x = "aaabbbbccaa"
        xs = re.findall(r"((.)\2*)", x)
        print(xs)
        start = 0
        output = '' 
        for item in xs:
            end = start + (len(item[0])-1)
            startIndex = start
            endIndex = end
            qualityFlag = item[1]
            print(startIndex, endIndex, qualityFlag)
            start = end+

1 个答案:

答案 0 :(得分:1)

使用 udf() 包装代码逻辑,并使用 to_json() 将结构数组转换为字符串:

from pyspark.sql.functions import udf, to_json
import re

df = spark.createDataFrame([
      ('01', 'AABBBAA')
    , ('02', 'SSSAAAA')
  ] , ['id', 'Values']
)

# argument `x` is a StringType() over the udf function
# return `row` as a list of dicts
@udf('array<struct<startIndex:long,endIndex:long,flag:string>>')
def set_fields(x):
    row = []
    for m in re.finditer(r'(.)\1*', x):
        row.append({
            'startIndex': m.start()
          , 'endIndex': m.end()-1
          , 'flag': m.group(1)
        })
    return row

df.select('id', to_json(set_fields('Values')).alias('Values')).show(truncate=False)
+---+----------------------------------------------------------------------------------------------------------------------------+
|id |Values                                                                                                                      |
+---+----------------------------------------------------------------------------------------------------------------------------+
|01 |[{"startIndex":0,"endIndex":1,"flag":"A"},{"startIndex":2,"endIndex":4,"flag":"B"},{"startIndex":5,"endIndex":6,"flag":"A"}]|
|02 |[{"startIndex":0,"endIndex":2,"flag":"S"},{"startIndex":3,"endIndex":6,"flag":"A"}]                                         |
+---+----------------------------------------------------------------------------------------------------------------------------+