我有一个.csv文件,其列值包含一些逗号。以下是示例:
Header: ID Value Content Date
1 34 "market, business" 12/20/2013
2 15 "market, business", yesterday, metric 11/21/2014
3 18 "market," business and yesterday 10/20/2014
4 19 yesterday, today, 11/22/2014
这是.csv文件的格式,如果我在Sublime Text中打开,它将以格式显示:
1, 34, "market, business", 12/20/2013
2, 15, "market, business", "yesterday, metric, 11/21/2014
3, 18, "market," business and yesterday, 10/20/2014
4, 19, yesterday, today, 11/22/2014
但我想要的是在python csv阅读器程序之后:
[1, 34, "market, business", 12/20/2013]
[2, 15, "market, business" "yesterday metric, 11/21/2014]
[3, 18, "market," business and yesterday, 10/20/2014]
[4, 19, yesterday today, 11/22/2014]
这些只是我拥有的样本数据,"内容"列是令人头痛的原因csv模块使用","作为分隔符,我用
reader = csv.reader(f, skipinitialspace=True)
如果所有字符串都在一个双引号内,它适用于第一行。但如果引号之外有逗号(单引号或双引号),则它不适用于第三行和第二行
我该如何解决这个问题?我现在只是在python中使用传统的csv模块," panda"有解决问题的能力吗?
感谢。
我做了一些更新,我想我想要的是,在不同的地方指定逗号的方法...... 现在我在这里粘贴似乎不合理因为我无法在csv模块中找到它来区分分隔符","和","在一个领域内。即使是excel也不能......
有什么想法吗?
答案 0 :(得分:1)
如果我们可以假设
然后您的数据可以这样解析:
data = list()
with open('data') as f:
for line in f:
parts = line.split(',', 2)
parts[2:4] = parts[2].rsplit(',', 1)
parts[:2] = map(int, parts[:2])
parts[2:] = map(str.strip, parts[2:])
data.append(parts)
for row in data:
print(row)
产量
[1, 34, '"market, business"', '12/20/2013']
[2, 15, '"market, business", "yesterday, metric', '11/21/2014']
[3, 18, '"market," business and yesterday', '10/20/2014']
[4, 19, 'yesterday, today', '11/22/2014']
然后你可以像这样制作一个DataFrame:
import pandas as pd
df = pd.DataFrame(data, columns=['Id','Value','Content','Date'])
print(df)
产量
Id Value Content Date
0 1 34 "market, business" 12/20/2013
1 2 15 "market, business", "yesterday, metric 11/21/2014
2 3 18 "market," business and yesterday 10/20/2014
3 4 19 yesterday, today 11/22/2014