有什么方法可以从光束管道的嵌套记录中获取很少的字段?

时间:2019-06-12 13:08:48

标签: python apache-beam

我正在读取一个Avro文件,该文件的嵌套架构包含太多字段。例如:employeeId,empName,empPersonalInfo.Address.city等。我想编写一个parDo函数以仅从管道记录(employeeId,empPersonalInfo.Address.city)中获取几个字段

schema of an avro file is :
{
     "namespace"    : "studentjoin.avro",
     "type"         : "record",
     "name"         : "student",
     "fields"       : [
      {"name": "personalInfo",
       "type": { "type" : "array", "items": { 
           "type" : "record",                                
               "name" : "studentinfo",
           "fields": [
                 {"name": "studentId", "type": "int"},
                 {"name": "studentName",  "type": ["string", "null"]},
                 {"name": "studentAddress", "type": {
                    "type" : "array", "items" : {
                        "type": "record", "name" : "addressInfo", 
                        "fields":
                         [
                            {"name" : "streetName", "type": ["string", "null"] },
                            {"name": "city", "type": ["string","null"]}
                         ] }}},

                 {"name": "studentBranch", "type": ["string", "null"]}
                 ]
        } }
    }

    ]
}

如果没有嵌套字段,则以下内容可以完美运行:

fields_of_interest = (p | 'Projected' >> beam.Map( 
          lambda row: {f: row[f] for f in selected_fileld_names}))

java SDk中有一个嵌套的内置函数,如果在python中可能实现相同的功能,它将首先在一个级别上转换所有嵌套字段。

1 个答案:

答案 0 :(得分:0)

df1 = pd.read_csv(file1)
cols = ['Opening Balance', 'Subscriptions/Redemptions', 'Gain (Loss)'] 
for col in cols:
    df1[col] = pd.to_numeric(df1[col], errors='coerce')
    num = ((df1['Opening Balance'] + df1['Subscriptions/Redemptions'] + df1['Gain (Loss)']))
    denom = df1['Opening Balance']
    performance = num/denom
df['new column'] = performance

您不能简单地拼合字典,它包含列表(由pl = (pl | "Extract" >> beam.Map(lambda x: (x["student"]["personalInfo"][0]["studentInfo"]["studentId"], x["student"]["personalInfo"][0]["studentInfo"]["studentAddress"][0]["addressInfo"])) 指定,这意味着可以用不同的方法对其进行拼合。如果有多个地址(具有多个城市名称)怎么办?请返回第一个或全部?在上面的实现中,它仅返回第一个元素。