Question

我想从文件名中获取数据（因为它包含一些信息。）并在不使用循环的情况下将这些数据写入csvfile_info文件中。我是pyspark的新手。请有人帮我编码，让我知道如何继续。这就是我试过的......

代码： c = os.path.join（＆＃34; -------＆＃34;）

input_file = sc.textFile(fileDir)
file1= input_file.split('_')
csvfile_info= open(c,'a')
details= file1.map(lambda p:
    name=p[0], 
    id=p[1],
    from_date=p[2],
    to_date=p[3],
    TimestampWithExtension=p[4]\
    file_timestamp=TimestampWithExtension.split('.')[0]\
    info = '{0},{1},{2},{3},{4},{5} \n'.\
    format(name,id,from_date,to_date,file_timestamp,input_file)\
    csvfile_info.write(info)
    )

Answer 1

不要尝试在map()函数内写入数据。您应该将每个记录映射到相应的字符串，然后将生成的rdd转储到文件中。试试这个：

input_file = sc.textFile(fileDir)  # returns an RDD

def map_record_to_string(x):
    p = x.split('_')
    name=p[0]
    id=p[1]
    from_date=p[2]
    to_date=p[3]
    TimestampWithExtension=p[4]

    file_timestamp=TimestampWithExtension.split('.')[0]
    info = '{0},{1},{2},{3},{4},{5} \n'.format(
        name,
        id,
        from_date,
        to_date,
        file_timestamp,
        input_file
    )
    return info

details = input_file.map(map_record_to_string)  # returns a different RDD
details.saveAsTextFile("path/to/output")

注意：我没有测试过这段代码，但这是你可以采取的一种方法。

<强>解释

从docs，input_file = sc.textFile(fileDir)将返回RDD个包含文件内容的字符串。

您要执行的所有操作都在RDD的内容上，即文件的元素。在RDD上调用split()没有意义，因为split()是一个字符串函数。您要做的是调用split()以及RDD中每条记录（文件中的行）上的其他操作。这完全 map()的作用。

RDD就像一个可迭代的，但你不用传统的循环操作它。它是一种允许并行化的抽象。从用户的角度来看，map(f)函数将函数f应用于RDD中的每个元素，就像在循环中完成一样。功能调用input_file.map(f)等同于以下内容：

# let rdd_as_list be a list of strings containing the contents of the file
map_output = []
for record in rdd_as_list:
    map_output.append(f(record))

或等效地：

# let rdd_as_list be a list of strings containing the contents of the file
map_output = map(f, rdd_as_list)

在RDD上调用map()会返回一个新的RDD，其内容是应用该函数的结果。在这种情况下，details是一个新的RDD，它在input_file处理后包含map_record_to_string行。

如果这样做更容易理解，您也可以将map()步骤写为details = input_file.map(lambda x: map_record_to_string(x))。

在pyspark中的map函数内部操作

1 个答案: