我想用熊猫读入像这样的文本文件:
Flag Rootname "Target Name" RA DEC PropID "PI Name" Detector Segment LP Grating Cenwave FPPOS Exptime Nevents "Mean Flux" "Median Flux" Date "Target Description"
1 lcjw02hwq RXJ2043.1+0324 310.7761535644531 3.4143054485321045 13840 Fox FUV BOTH 2 G130M 1291 2 2935.199951171875 1472553.0 4.113247049008454e-15 3.6400732688485204e-15 2014-10-23 "ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD"
1 lcjw02ikq RXJ2043.1+0324 310.7761535644531 3.4143054485321045 13840 Fox FUV BOTH 2 G130M 1291 4 1375.199951171875 769373.0 4.134839189383387e-15 3.562496062308341e-15 2014-10-23 "ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD"
每个新术语或值都用空格分隔,但问题是其中一些用引号引起来,并且内部也有空格。例如,"Target Name"
应该是一列的名称,而"ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD"
是一个值(尽管我不需要用引号将它们存储起来)。
我使用quotechar
作为输入尝试了以下代码,但是它不起作用。
df = pd.read_csv(path, sep='\s', quotechar='"')
我也看到了this related question,但是建议添加skipinitialspace = True
作为自变量也没有帮助。当我调用df.head()
时,仍然可以看到它正在将“目标名称”等拆分为两个单独的列名称。有办法解决这个问题吗?
答案 0 :(得分:1)
尝试将sep='\s'
更改为sep=' '
:
df = pd.read_csv('<your file>', sep=' ', quotechar='"')
print(df)
打印:
Flag Rootname ... Date Target Description
0 1 lcjw02hwq ... 2014-10-23 ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD
1 1 lcjw02ikq ... 2014-10-23 ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD
[2 rows x 19 columns]
df.to_csv()
然后生成(来自LibreOffice的屏幕截图):