食品,饮料和食品的错误原因;烟草'它有额外的逗号,导致pandas无法读取csv文件。 它会导致错误
标记数据时出错。 C错误:第29行预计3个字段,见4
如何优雅地消除csv文件中针对GICS行业组的额外逗号'(包括逗号旁边的条件是否为食品)?
这是我的代码:
#!/usr/bin/env python2.7
print "hello from python 2"
import pandas as pd
from lxml import html
import requests
import urllib2
import os
url = 'http://www.asx.com.au/asx/research/ASXListedCompanies.csv'
response = urllib2.urlopen(url)
html = response.read()
#html = html.replace('"','')
with open('asxtest.csv', 'wb') as f:
f.write(html)
with open("asxtest.csv",'r') as f:
with open("asx.csv",'w') as f1:
f.next()#skip header line
f.next()#skip 2nd line
for line in f:
if line.count(',')>2:
line[2] = 'Food Beverage & Tobacco'
f1.write(line)
os.remove('asxtest.csv')
df_api = pd.read_csv('asx.csv')
df_api.rename(columns={'Company name': 'Company', 'ASX code': 'Stock','GICS industry group': 'Industry'}, inplace=True)
答案 0 :(得分:2)
帖子中网址中的文件包含GICS industry group
列中某些项目的其他逗号。第一个发生在文件的第31行:
ABUNDANT PRODUCE LIMITED,ABT,Food, Beverage & Tobacco
通常情况下,第3项应该用引号括起来以逃避逗号,例如:
ABUNDANT PRODUCE LIMITED,ABT,"Food, Beverage & Tobacco"
对于这种情况,因为前两列看起来很干净,所以您可以将任何其他文本合并到第三个字段中。清洁完成后,将其加载到数据框中。
您可以使用生成器执行此操作,该生成器将一次拉出并清理每一行。 pd.DataFrame
构造函数将读入数据并创建数据框。
import pandas as pd
def merge_last(file_name, skip_lines=0):
with open(file_name, 'r') as fp:
for i, line in enumerate(fp):
if i < 2:
continue
x, y, *z = line.strip().split(',')
yield (x,y,','.join(z))
# create a generator to clean the lines, skipping the first 2
gen = merge_last('ASXListedCompanies.csv', 2)
# get the column names
header = next(gen)
# create the data frame
df = pd.DataFrame(gen, columns=header)
df.head()
返回:
Company name ASX code GICS industry group
0 MOQ LIMITED MOQ Software & Services
1 1-PAGE LIMITED 1PG Software & Services
2 1300 SMILES LIMITED ONT Health Care Equipment & Services
3 1ST GROUP LIMITED 1ST Health Care Equipment & Services
4 333D LIMITED T3D Commercial & Professional Services
保留带有额外逗号的行:
df.loc[27:30]
# returns:
Company name ASX code GICS industry group
27 ABUNDANT PRODUCE LIMITED ABT Food, Beverage & Tobacco
28 ACACIA COAL LIMITED AJC Energy
29 ACADEMIES AUSTRALASIA GROUP LIMITED AKG Consumer Services
30 ACCELERATE RESOURCES LIMITED AX8 Class Pend
这是一个更通用的生成器,它将在给定数量的列之后合并:
def merge_last(file_name, merge_after_col=2, skip_lines=0):
with open(file_name, 'r') as fp:
for i, line in enumerate(fp):
if i < 2:
continue
spl = line.strip().split(',')
yield (*spl[:merge_after_col], ','.join(spl[merge_after_col:]))