我希望在python脚本中读取一系列制表符分隔的文件。出于某种原因,当我导入文件时,我的所有文本列都以NaN形式返回。
输入文件的示例:
Blah Blah
Blah Blah
Blah Blah
Blah Blah
Blah Blah
Blah Blah
Blah Blah
Period: Oct 28 2013 - Apr 27 2014
Note:
Brand Variant Industry Major Category Market Media Type Parent Company Product Category Report Period (multiple) PCC Sub Group Subsidiary Units $$$ (000)
3 LADIES HAND-DIPPED CANDIES CANDY CONFECT., SNACKS & SOFT DRINKS CONFECTIONERY & SNACKS Columbus Combo Local Newspaper COTTAGE FOOD PRODUCTION OPERATION CANDY 11/18/13 - 11/24/13 F211 CANDY & GUM COTTAGE FOOD PRODUCTION OPERATION 1 0.286
3 MUSKETEERS CANDY BAR CONFECT., SNACKS & SOFT DRINKS CONFECTIONERY & SNACKS Atlanta Combo Spot Radio MARS INC CANDY BAR 11/04/13 - 11/10/13 F211 CANDY & GUM MARS SNACKFOOD US LLC 22 1.403
这是我的python片段(3.3):
df = read_csv(csvFile, delimiter='\t', header=[9])
print(df)
输出以下内容:
Brand Variant \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Industry \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Major Category \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Market \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Media Type \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Parent Company \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Product Category \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Report Period (multiple) \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
PCC Sub Group \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Subsidiary \
3 LADIES HAND-DIPPED CANDIES CANDY NaN
3 MUSKETEERS CANDY BAR NaN
Units $$$ (000)
3 LADIES HAND-DIPPED CANDIES CANDY NaN NaN
3 MUSKETEERS CANDY BAR NaN NaN
我注意到我的第一列似乎被设置为数据帧的索引,但是index_col = False只会产生一个ValueError,因为它需要一个列号。同样我尝试将dtype设置为str但没有运气。最后,在逗号分隔的另一个文件上,我能够返回包含文本数据的行。我很遗憾该怎么做......
我注意到的一件事是在字段之间更像是tab&空间。
答案 0 :(得分:1)
如果您想忽略“Blah Blah”的前几行,请使用skiprows=
代替header=
。试试这个:
df = pd.read_csv(csvFile, sep='\t', skiprows=9, index_col=False)
原因
我想,“第一列似乎被设置为数据帧的索引”
是你的文件有尾随分隔符。如果是这种情况,index_col=False
应该有所帮助。见Handling of trailing delimiters in read_csv
由于我没有你的输入文件,你的复制粘贴文本显然已经破坏了标签(文本中的所有空格),我无法测试它。但请告诉我们。