我想阅读一个带逗号数字的* .csv文件。
例如,
FILE.CSV
Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201 # The last value is 1201, not 201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117 # The last value is 1117, not 117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175 # The last value is 10175, not 175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697 # The last value is 1697, not 697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272 # The last value is 1272, not 272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
...
2014/07/10,12:05:00,'10195,'10300,'10155,'10290,219,271 # The last value is 219271, not 271
2014/07/09,12:04:00,'10345,'10360,'10185,'10194,235,711 # The last value is 235711, not 711
2014/07/08,12:03:00,'10339,'10420,'10301,'10348,232,050 # The last value is 242050, not 050
它实际上有7列,但最后一列的值有时会有逗号,pandas会将它们作为额外的列。
我的问题是,如果有任何方法我可以让pandas只占用前6个逗号,并在读取列时忽略其余的逗号,或者是否有任何方法可以在第6个逗号后删除逗号(I'对不起,但我想不出有任何功能。)
感谢您阅读此内容:)
答案 0 :(得分:1)
解决问题的另一种方法。
import re
import pandas as pd
l1 =[]
with open('/home/yusuf/Desktop/c1') as f:
headers = map(lambda x: x.strip(), f.readline().strip('\n').split(','))
for a in f.readlines():
b = re.findall("(.*?),(.*?),'(.*?),'(.*?),'(.*?),'(.*?),(.*)",a)
l1.append(list(b[0]))
df = pd.DataFrame(data=l1, columns=headers)
df['Volume'] = df['Volume'].apply(lambda x: x.replace(",",""))
df
输出:
正则表达式演示:
https://regex101.com/r/o1zxtO/2
答案 1 :(得分:1)
我很确定pandas
无法处理,但您可以轻松修复最终的列。 Python中的一种方法
with open('yourfile.csv') as csv, open('newcsv.csv','w') as result:
for line in csv:
columns = line.split(',')
if len(columns) > COLUMNAMOUNT:
columns[COLUMNAMOUNT-1] += ''.join(columns[COLUMNAMOUNT:])
result.write(','.join(columns[COLUMNAMOUNT-1]))
现在您可以将新的csv加载到pandas中。其他解决方案可以是AWK甚至shell脚本。
答案 2 :(得分:1)
您可以在Python中完成所有操作,而无需将数据保存到新文件中。我们的想法是清理数据并为pandas添加类似字典的格式以抓取它并将其转换为数据帧。以下应该是一个不错的起点:
from collections import defaultdict
from collections import OrderedDict
import pandas as pd
# Import the data
data = open('prices.csv').readlines()
# Split on the first 6 commas
data = [x.strip().replace("'","").split(",",6) for x in data]
# Get the headers
headers = [x.strip() for x in data[0]]
# Get the remaining of the data
remainings = [list(map(lambda y: y.replace(",",""), x)) for x in data[1:]]
# Create a dictionary-like container
output = defaultdict(list)
# Loop through the data and save the rows accordingly
for n, header in enumerate(headers):
for row in remainings:
output[header].append(row[n])
# Save it in an ordered dictionary to maintain the order of columns
output = OrderedDict((k,output.get(k)) for k in headers)
# Convert your raw data into a pandas dataframe
df = pd.DataFrame(output)
# Print it
print(df)
这会产生:
Date Time Open High Low Close Volume
0 2016/11/09 12:10:00 4355 4358 4346 4351 1201
1 2016/11/09 12:09:00 4361 4362 4353 4355 1117
2 2016/11/09 12:08:00 4364 4374 4359 4360 10175
3 2016/11/09 12:07:00 4371 4376 4360 4365 590
4 2016/11/09 12:06:00 4359 4372 4358 4369 420
5 2016/11/09 12:05:00 4365 4367 4356 4359 542
6 2016/11/09 12:04:00 4379 1380 4360 4365 1697
7 2016/11/09 12:03:00 4394 4396 4376 4381 1272
8 2016/11/09 12:02:00 4391 4399 4390 4393 524
起始文件(prices.csv
)如下:
Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
我希望这会有所帮助。
答案 3 :(得分:0)
我猜大熊猫无法处理它,所以我会用Perl进行预处理以生成新的cvs并对其进行处理。
在这种情况下使用Perl split可以帮助您
perl -pne '$_ = join("|", split(/,/, $_, 7) )' < input.csv > output.csv
然后你可以在输出文件中使用通常的read_cvs,并将分隔符作为|