我正在使用Python来阅读通过网络刮刀获得的一系列CSV(其中数以千计,因此手动编辑是不行的)。数据如下所示:
"Client: Secret Client"
"G/L Account: (#-#-#) Secret Type of Account"
"Process Date: MM/DD/YYYY"
"Export Date: MM/DD/YYYY"
"Unit Name ","Description","Pay. Type ","Amount","Tran. Date "
"last, first","some note (dates with commas like 17 Aug, 2018 could be here)","Credit Card ","$AMNT.CHANGE","Date and Timestamp"
"Total","","","$AMNT.CHANGE","
如果你仔细考虑,你会看到一个最后一个逗号,然后是一个流氓"。我试图使用的代码在这里:
import os
import pandas as pd
import csv
def read_temp(file):
tmp = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=5, quoting=csv.QUOTE_ALL,skipinitialspace=True, skipfooter=1)
gl = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=1, nrows=1, quoting=csv.QUOTE_ALL,skipinitialspace=True)
proc_date = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=2, nrows=1, quoting=csv.QUOTE_ALL,skipinitialspace=True)
cols = ['NAME', 'DESCRIPTION', 'PAY_TYP', 'AMOUNT', 'TRAN_DATE']
tmp.columns = cols
# print(tmp.columns)
# print(file)
tmp['G/L_ACCOUNT'] = gl[0][0].split(':')[1]
tmp['PROCESS_DATE'] = proc_date[0][0].split(':')[1]
for col in tmp.columns:
tmp[col] = tmp[col].str.strip('"')
return tmp
master = "C:\\path\\to\\master\\"
want=[]
flag = 0
for direc in os.listdir(master):
for file in os.listdir(master+direc):
temp = read_temp(master+direc+'\\'+file)
want.append(temp)
df = pd.concat(want)
错误是:
',' expected after '"'
我想如果我可以使用CSV阅读器和正则表达式(我没有经验)来预先阅读每一行,并找到被"包围的所有内容。 "然后我可以以某种方式更改它或者删除结束逗号和双引号。 任何想法将不胜感激!
答案 0 :(得分:1)
csv
模块的快速测试不会失败
import csv
data = """"Client: Secret Client"
"G/L Account: (#-#-#) Secret Type of Account"
"Process Date: MM/DD/YYYY"
"Export Date: MM/DD/YYYY"
"Unit Name ","Description","Pay. Type ","Amount","Tran. Date "
"last, first","some note (dates with commas like 17 Aug, 2018 could be here)","Credit Card ","$AMNT.CHANGE","Date and Timestamp"
"Total","","","$AMNT.CHANGE","
"""
reader = csv.reader(data.split("\n"), delimiter=',', quotechar='"')
for row in reader:
print(', '.join(row))
但也被最后一个不完整的元素“混淆”:
Client: Secret Client
G/L Account: (#-#-#) Secret Type of Account
Process Date: MM/DD/YYYY
Export Date: MM/DD/YYYY
Unit Name , Description, Pay. Type , Amount, Tran. Date
last, first, some note (dates with commas like 17 Aug, 2018 could be here), Credit Card , $AMNT.CHANGE, Date and Timestamp
Total, , , $AMNT.CHANGE,
但您可以从数据中删除有问题的字符,例如使用rfind
和“slicing”:
pos = data.rfind(',"', -5)
if pos != -1:
data = data.strip()[:pos]
print( data[-15:] )
应打印,"$AMNT.CHANGE"
。
它在字符串的最后5个字符上搜索,"
。如果找到,则返回位置,用于删除相应的字符(或者更确切地说,返回不带它们的字符串)。
strip()
只是删除任何换行符(通过使用字符串文字“”“嵌入数据而引入。)
或者,如果问题总是那两个额外的字符,您可以通过提供负片索引来切片它们,例如data[:-2]
不需要regular expression,但是
import re
data = re.sub(",\"?$", "", data, 1)
可以做到这一点,它也适用于只有一个尾随,
的情况。
你可以play with this on regex101.com解释表达的作用。
现在,大熊猫解析数据时不会有任何问题。