我尝试将excel文件列表更改为csvs,然后将它们加载到pandas数据帧中,但我不确定如何在脚本中转换它们。 Csvkit和xlsx2csv似乎适用于从命令行执行此操作,但是当我尝试启动类似的子进程时
for filename in sorted_files:
file = subprocess.Popen("in2csv filename", stdout=subprocess.PIPE)
print file.stdout
dataframe = pd.read_csv(file)
我收到了错误
IOError:预期的文件路径名或类文件对象,获得类型 当格式为"已修复"
时,schema不得为null是否可以从子进程获取输出并将其传递给数据帧?任何帮助非常感谢!
答案 0 :(得分:0)
尽管提出问题已经很久了,但我遇到了同样的问题,这是在python脚本中实现的方式:
只能使用sheetid参数执行Xlsx2csv。为了获取工作表名称和ID,使用了get_sheet_details。
csvfrmxlsx为父目录下csv文件夹中的每个工作表创建csv文件。
import pandas as pd
from pathlib import Path
def get_sheet_details(filename):
import xmltodict
import shutil
import zipfile
sheets = []
# Make a temporary directory with the file name
directory_to_extract_to = (filename.with_suffix(''))
directory_to_extract_to.mkdir(parents=True, exist_ok=True)
# Extract the xlsx file as it is just a zip file
zip_ref = zipfile.ZipFile(filename, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
# Open the workbook.xml which is very light and only has meta data, get sheets from it
path_to_workbook = directory_to_extract_to / 'xl' / 'workbook.xml'
with open(path_to_workbook, 'r') as f:
xml = f.read()
dictionary = xmltodict.parse(xml)
for sheet in dictionary['workbook']['sheets']['sheet']:
sheet_details = {
'id': sheet['@sheetId'], # can be sheetId for some versions
'name': sheet['@name'] # can be name
}
sheets.append(sheet_details)
# Delete the extracted files directory
shutil.rmtree(directory_to_extract_to)
return sheets
def csvfrmxlsx(xlsxfl, df): # create csv files in csv folder on parent directory
from xlsx2csv import Xlsx2csv
(xlsxfl.parent / 'csv').mkdir(parents=True, exist_ok=True)
for index, row in df.iterrows():
shnum = row['id']
shnph = xlsxfl.parent / 'csv' / Path(row['name'] + '.csv') # path for converted csv file
Xlsx2csv(str(xlsxfl), outputencoding="utf-8").convert(str(shnph), sheetid=int(shnum))
return
pthfnc = 'c:/xlsx/'
wrkfl = 'my.xlsx'
xls_file = Path(pthfnc + wrkfl)
sheetsdic = get_sheet_details(xls_file) # dictionary with sheet names and ids without opening xlsx file
df = pd.DataFrame.from_dict(sheetsdic)
csvfrmxlsx(xls_file, df) # df with sheets to be converted