Question

我正在读取一个xls文件，并使用pyspark在databricks中转换为csv文件。我的输入数据是xls文件中的字符串格式101101114501700。但是使用熊猫将其转换为CSV格式并写入datalake文件夹后，我的数据显示为101101114501700.0。我的代码如下。请帮我为什么我要在数据中获取小数部分。

for file in os.listdir("/path/to/file"):
     if file.endswith(".xls"):
       filepath = os.path.join("/path/to/file",file)         
       filepath_pd = pd.ExcelFile(filepath)
       names = filepath_pd.sheet_names        
       df = pd.concat([filepath_pd.parse(name) for name in names])        
       df1 = df.to_csv("/path/to/file"+file.split('.')[0]+".csv", sep=',', encoding='utf-8', index=False)
       print(time.strftime("%Y%m%d-%H%M%S") + ": XLS files converted to CSV and moved to folder"

Answer 1

您的问题与Spark或PySpark没有关系。与Pandas有关。

这是因为Pandas自动解释和推断列的数据类型。由于列的所有值都是数字，因此Pandas会将其视为float数据类型。

为避免这种情况，pandas.ExcelFile.parse方法接受名为converters的参数，您可以使用此方法通过以下方式告诉Pandas特定的列数据类型：

# if you want one specific column as string
df = pd.concat([filepath_pd.parse(name, converters={'column_name': str}) for name in names])

OR

# if you want all columns as string
# and you have multi sheets and they do not have same columns
# this merge all sheets into one dataframe
def get_converters(excel_file, sheet_name, dt_cols):
    cols = excel_file.parse(sheet_name).columns
    converters = {col: str for col in cols if col not in dt_cols}
    for col in dt_cols:
        converters[col] = pd.to_datetime
    return converters

df = pd.concat([filepath_pd.parse(name, converters=get_converters(filepath_pd, name, ['date_column'])) for name in names]).reset_index(drop=True)

OR

# if you want all columns as string
# and all your sheets have same columns
cols = filepath_pd.parse().columns
dt_cols = ['date_column']
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
    converters[col] = pd.to_datetime
df = pd.concat([filepath_pd.parse(name, converters=converters) for name in names]).reset_index(drop=True)

Answer 2

我认为在读取Excel时，该字段会自动解析为float。之后，我会予以纠正：

df['column_name'] = df['column_name'].astype(int)

如果您的列包含Null，则无法转换为整数，因此您需要先填充null：

df['column_name'] = df['column_name'].fillna(0).astype(int)

然后，您可以连接并存储您的操作方式

如何使用大熊猫从字符串中删除小数点

2 个答案: