在这个项目中,我将csv文件转换为xls文件,将txt文件转换为xls文件。目标是比较两个xls文件的差异,并将任何差异打印到第三个excel文件。
但是,当打印差异时,它们包含任何大于999的整数的条目,因为我转换的csv文件中的任何整数都被视为字符串而不是整数。因此,由于转换后的csv excel文件中的逗号,它会将1200(在我转换的xls文件中)中的值与1200(在我转换的txt文件中)区别对待。
我的问题是:有没有办法将字符串解释的整数转换回被解释为整数?否则,有没有办法删除我的xls文件中的所有逗号?我尝试过通常的dataframe.replace方法,但效果不佳。
以下是我的代码:
#import required libraries
import datetime
import xlrd
import pandas as pd
#define the time_handle function to name the outputted excel files
time_handle = datetime.datetime.now().strftime("%Y%m%d_%H%M")
#identify XM1 file paths (for both csv origin and excel destination)
XM1_csv = r"filepath"
XM2_excel = r"filepath" + time_handle + ".xlsx"
#identify XM2 file paths (for both txt origin and excel destination)
XM2_txt = r"filepath"
XM2_excel = r"filepath" + time_handle + ".xlsx"
#remove commas from XM1 excel - failed attempts
#XM1_excel = [col.replace(',', '') for col in XM1_excel]
#XM1_excel = XM1_excel.replace(",", "")
#for line in XM1_excel:
#XM1_excel.write(line.replace(",", ""))
#remove commas from XM1 CSV - failed attempts
#XM1_csv = [col.replace(',', '') for col in XM1_csv]
#XM1_csv = XM1_csv.replace(",", "")
#for line in XM1_csv:
#XM1_excel.write(line.replace(",", ""))
#convert the csv XM1 file to an excel file, in the same folder
pd.read_csv(XM1_csv).to_excel(XM1_excel)
#convert the txt XM2 file to an excel file in the same folder
pd.read_csv(XM2_txt, sep="|").to_excel(XM2_excel)
#confirm XM1 filepath
filepath_XM1 = XM1_excel
#confirm XM2 filepath
filepath_XM2 = XM2_excel
#read relevant columns from the excel files
df1 = pd.read_excel(filepath_XM2, sheetname="Sheet1", parse_cols= "H, J, M, U")
df2 = pd.read_excel(filepath_XM1, sheetname="Sheet1", parse_cols= "C, E, G, K")
#remove all commas from XM1 - failed attempts
#df2 = [col.replace(',', '') for col in df2]
#df2 = df2.replace(",", "")
#for line in df2:
#df2.write(line.replace(",", ""))
#merge the columns from both excel files into one column each respectively
df4 = df1["Exchange Code"] + df1["Product Type"] + df1["Product Description"] + df1["Quantity"].apply(str)
df5 = df2["Exchange"] + df2["Product Type"] + df2["Product Description"] + df2["Quantity"].apply(str)
#concatenate both columns from each excel file, to make one big column containing all the data
df = pd.concat([df4, df5])
#remove all whitespace from each row of the column of data
df=df.str.strip()
df=["".join(x.split()) for x in df]
#convert the data to a dataframe from a series
df = pd.DataFrame({'Value': df})
#remove any duplicates
df.drop_duplicates(subset=None, keep=False, inplace=True)
#print to the console just as a visual aid
print(df)
#output_path = r"filepath"
#print the erroneous entries to an excel file
df.to_excel("XM1_XM2Comparison" + time_handle + ".xls")
另外,我意识到关于df1和df2的XM1和XM2文件名有点令人困惑,但我只是重命名了我的文件。它在文件方面以及它们在代码中的位置是有意义的!
谢谢
答案 0 :(得分:1)
您可以在数据框的读取端尝试一个名为converters
的参数,您可以在其中指定数据类型。例如:
df= pd.read_excel(file, sheetname=YOUR_SHEET_HERE, converters={'FIELD_NAME': str})
converters
同时位于read_csv
和read_excel
答案 1 :(得分:0)
我实际上通过一个简单的修复解决了这个问题,以备将来参考。当使用pd.read_csv读取csv时,我添加了千位方法,所以看起来像这样:
pd.read_csv(XM1, thousands = ",").to_excel(XM1_excel)