我正在尝试将二进制文件转换为ascii值并将其存储在数据帧中。转换为ascii工作正常。但是,当我尝试转换为Spark Dataframe时,所有字段仅获得空值。不确定缺少的部分。
返回的df应该是熊猫DF,但其显示为列表。
二进制文件包含2个固定大小为16字节的记录。输入值如下:
01 01 02 0D FF E3 33 52 14 75 26 58 87 7F FF FF 01 01 02 0D FF E3 33 52 14 75 26 58 87 7F FF FF
请帮助解决。下面是代码和输出。
%spark2.pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import *
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
import binascii
import pandas as pd
import numpy as np
import datetime
from string import printable
recordsize = 16
chunkcount = 100
chunksize = recordsize * chunkcount
sparkSchema = StructType([
StructField ("Field1", IntegerType(), True),
StructField ("Field2", StringType(), True),
StructField ("Field3", StringType(), True),
StructField ("Field4", StringType(), True)
])
dt = np.dtype([
('Field1', 'b'),
('Field2', np.void, 4),
('Field3', np.void, 3),
('Field4', np.void, 8)
])
StartTime = datetime.datetime.now()
print ("Start Time: " + str(StartTime))
inputfile = "/user/maria_dev/binfiles1"
def decodeRecord(data):
x = np.frombuffer (data[1], dtype=dt)
newx = x.byteswap().newbyteorder()
df = pd.DataFrame(newx)
st = set(printable)
df[['Field2', 'Field3', 'Field4']] = df[['Field2', 'Field3', 'Field4']].applymap(
lambda x: binascii.hexlify(x).decode('utf-8').rstrip('f'))
return df
conf = SparkConf().setAppName("BinaryReader").setMaster("local")
sqlContext = SQLContext (sc)
rdd = sc.binaryFiles(inputfile).map(decodeRecord).collect()
print (type(rdd))
print (rdd)
df = sqlContext.createDataFrame(rdd, sparkSchema)
print ("Number of records in DataFrame: " + str(df.count()))
df.show()
输出如下:
Start Time: 2018-12-12 20:11:55.141848
<type 'list'>
[ Field1 Field2 Field3 Field4
0 1 01020d e33352 14752658877
1 1 01020d e33352 14752658877]
Number of records in DataFrame: 1
+------+------+------+------+
|Field1|Field2|Field3|Field4|
+------+------+------+------+
| null| null| null| null|
+------+------+------+------+
答案 0 :(得分:0)
您的decodeRecord()
函数返回一个熊猫数据框,因此生成的PipelinedRDD
包含一行,其中包含完整的熊猫数据框。因此,您必须采用第一行并将其转换为spark数据框。
这是一段修改后的代码:
rdd = sc.binaryFiles(inputfile).map(decodeRecord)
panda_df = rdd.first()
print (type(rdd))
print (type(panda_df))
df = sqlContext.createDataFrame(panda_df)
print ("Number of records in DataFrame: " + str(df.count()))
df.show()
输出:
Start Time: 2018-12-15 17:43:21.241421
<class 'pyspark.rdd.PipelinedRDD'>
<class 'pandas.core.frame.DataFrame'>
Number of records in DataFrame: 4
+------+--------+------+----------------+
|Field1| Field2|Field3| Field4|
+------+--------+------+----------------+
| 48|31303130|323044|4646453333333532|
| 49|34373532|363538|3837374646464646|
| 48|31303130|323044|4646453333333532|
| 49|34373532|363538|3837374646464646|
+------+--------+------+----------------+
您的代码还有其他可能的改进,例如使用rdd.flatMap()或直接使用encodeRecord()函数获取熊猫DF并转换为spark DF,而无需调用rdd.map()。只是一些建议。