Question

我正在使用Python来阅读通过网络刮刀获得的一系列CSV（其中数以千计，因此手动编辑是不行的）。数据如下所示：

"Client: Secret Client"
"G/L Account: (#-#-#) Secret Type of Account"
"Process Date: MM/DD/YYYY"
"Export Date: MM/DD/YYYY"
"Unit Name ","Description","Pay. Type ","Amount","Tran. Date "
"last, first","some note (dates with commas like 17 Aug, 2018 could be here)","Credit Card ","$AMNT.CHANGE","Date and Timestamp"
"Total","","","$AMNT.CHANGE","

如果你仔细考虑，你会看到一个最后一个逗号，然后是一个流氓＆＃34;。我试图使用的代码在这里：

import os
import pandas as pd
import csv

def read_temp(file):
    tmp = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=5, quoting=csv.QUOTE_ALL,skipinitialspace=True, skipfooter=1)
    gl = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=1, nrows=1, quoting=csv.QUOTE_ALL,skipinitialspace=True)
    proc_date = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=2, nrows=1, quoting=csv.QUOTE_ALL,skipinitialspace=True)
    cols = ['NAME', 'DESCRIPTION', 'PAY_TYP', 'AMOUNT', 'TRAN_DATE']
    tmp.columns = cols
    # print(tmp.columns)
    # print(file)
    tmp['G/L_ACCOUNT'] = gl[0][0].split(':')[1]
    tmp['PROCESS_DATE'] = proc_date[0][0].split(':')[1]
    for col in tmp.columns:
        tmp[col] = tmp[col].str.strip('"')
    return tmp
master = "C:\\path\\to\\master\\"
want=[]
flag = 0
for direc in os.listdir(master):
    for file in os.listdir(master+direc):
        temp = read_temp(master+direc+'\\'+file)
        want.append(temp)

df = pd.concat(want)

错误是：

',' expected after '"'

我想如果我可以使用CSV阅读器和正则表达式（我没有经验）来预先阅读每一行，并找到被＆＃34;包围的所有内容。＆＃34;然后我可以以某种方式更改它或者删除结束逗号和双引号。任何想法将不胜感激！

Answer 1

csv模块的快速测试不会失败

import csv

data = """"Client: Secret Client"
"G/L Account: (#-#-#) Secret Type of Account"
"Process Date: MM/DD/YYYY"
"Export Date: MM/DD/YYYY"
"Unit Name ","Description","Pay. Type ","Amount","Tran. Date "
"last, first","some note (dates with commas like 17 Aug, 2018 could be here)","Credit Card ","$AMNT.CHANGE","Date and Timestamp"
"Total","","","$AMNT.CHANGE","
"""

reader = csv.reader(data.split("\n"), delimiter=',', quotechar='"')
for row in reader:
    print(', '.join(row))

但也被最后一个不完整的元素“混淆”：

Client: Secret Client
G/L Account: (#-#-#) Secret Type of Account
Process Date: MM/DD/YYYY
Export Date: MM/DD/YYYY
Unit Name , Description, Pay. Type , Amount, Tran. Date 
last, first, some note (dates with commas like 17 Aug, 2018 could be here), Credit Card , $AMNT.CHANGE, Date and Timestamp
Total, , , $AMNT.CHANGE,

但您可以从数据中删除有问题的字符，例如使用rfind和“slicing”：

pos = data.rfind(',"', -5)
if pos != -1:
    data = data.strip()[:pos]
print( data[-15:] )

应打印,"$AMNT.CHANGE"。它在字符串的最后5个字符上搜索,"。如果找到，则返回位置，用于删除相应的字符（或者更确切地说，返回不带它们的字符串）。

strip()只是删除任何换行符（通过使用字符串文字“”“嵌入数据而引入。）

或者，如果问题总是那两个额外的字符，您可以通过提供负片索引来切片它们，例如data[:-2]

不需要regular expression，但是

import re
data = re.sub(",\"?$", "", data, 1)

可以做到这一点，它也适用于只有一个尾随,的情况。你可以play with this on regex101.com解释表达的作用。

现在，大熊猫解析数据时不会有任何问题。

在最后一行用逗号读取CSV

1 个答案: