我有不包含标题的CSV文件,每行仅包含2列(时间和数据名称),这些列始终具有数据,其余各行的列数取决于数据。
我已成功将“常规” CSV文件导入具有一致列数的熊猫。它确实工作得很好,但是我在文档中看到了可以处理这种当前情况的任何内容。
以下是所讨论的CSV文件的片段:
1573081480.942000, /eeg, 843.3333, 854.61536, 851.79486, 849.3773, 863.0769
1573081480.942000, /eeg, 844.1392, 857.4359, 849.3773, 861.8681, 890.07324
1573081480.943000, /eeg, 853.8095, 853.8095, 850.989, 866.30035, 854.61536
1573081480.944000, /eeg, 855.42126, 855.0183, 846.1539, 852.1978, 846.9597
1573081480.947000, /eeg, 844.1392, 853.8095, 846.55676, 842.52747, 873.5531
1573081480.947000, /eeg, 848.97437, 853.00366, 851.79486, 853.00366, 856.2271
1573081480.948000, /eeg, 859.0476, 852.6007, 850.18317, 863.8828, 826.0073
1573081480.950000, /eeg, 859.0476, 851.79486, 853.00366, 866.30035, 819.5604
1573081480.950000, /eeg, 851.79486, 852.1978, 846.9597, 854.61536, 859.45056
1573081480.951000, /eeg, 856.63007, 853.00366, 846.55676, 840.9158, 854.21246
1573081480.960000, /elements/alpha_absolute, 0.48463312
1573081480.960000, /elements/beta_absolute, 0.061746284
1573081480.961000, /elements/gamma_absolute, 0.7263172
1573081480.961000, /elements/theta_absolute, 0.7263172
1573081480.961000, /elements/delta_absolute, 0.7263172
我需要的结果看起来像这样
time, eeg_0, eeg_1, eeg_2, eeg_3, delta, theta, alpha, beta, gamma
1573081480.942000, 844.1392, 857.4359, 849.3773, 861.8681,,,,,
1573081480.947000, 844.1392, 853.8095, 846.55676, 842.52747, 873.5531,,,,,
1573081480.947000, 848.97437, 853.00366, 851.79486, 853.00366, 856.2271,,,,,
1573081480.948000, 859.0476, 852.6007, 850.18317, 863.8828, 826.0073,,,,,
1573081480.960000,,,,,,,0.48463312,,
1573081480.960000,,,,,,,,0.061746284,
1573081480.961000,,,,,0.7263172,,,,
1573081480.961000,,,,,0.52961296,,,
1573081480.962000,,,,,,,,-0.26484978
如您所见,值的数量可以根据存储的数据而变化。
我希望导入过程与“普通” CSV文件一样简单和高效。
这是我希望避免的,它非常冗长且效率低下:
d = {
'time': [0.],
'eeg0': [0.],'eeg1': [0.],'eeg2': [0.],'eeg3': [0.],'eeg4': [0.],
'delta_absolute': [0.], 'theta_absolute': [0], 'alpha_absolute': [0], 'beta_absolute': [0], 'alpha_absolute': [0],
'acc0': [0], 'acc1': [0], 'acc2': [0], 'gyro0': [0], 'gyro1': [0], 'gyro2': [0],
'concentration': [0],'mellow': [0]
}
df_new_data = pd.DataFrame(data=d)
csvfile = open(fname)
csv_reader = csv.reader(csvfile, delimiter=',')
csv_data = list(csv_reader)
row_count = len(csv_data)
for row in csv_data:
if row[1] == ' /muse/acc':
df_new_data = df_new_data.append({'acc0' : row[2], 'acc1' : row[3], 'acc2' : row[4]}, ignore_index=True)
if row[1] == ' /muse/gyro':
df_new_data = df_new_data.append({'gyro0' : row[2], 'gyro1' : row[3], 'gyro2' : row[4]}, ignore_index=True)
编辑:
我发现,如果CSV文件的第一行包含的字段较少,则随后的任何行都将失败read_csv()。上面的CSV数据示例有效,但该示例无效:
573081480.960000, /elements/alpha_absolute, 0.48463312
1573081480.960000, /elements/beta_absolute, 0.061746284
1573081480.961000, /elements/gamma_absolute, 0.7263172
1573081480.961000, /elements/theta_absolute, 0.7263172
1573081480.961000, /elements/delta_absolute, 0.7263172
1573081480.942000, /eeg, 843.3333, 854.61536, 851.79486, 849.3773, 863.0769
1573081480.942000, /eeg, 844.1392, 857.4359, 849.3773, 861.8681, 890.07324
1573081480.943000, /eeg, 853.8095, 853.8095, 850.989, 866.30035, 854.61536
1573081480.944000, /eeg, 855.42126, 855.0183, 846.1539, 852.1978, 846.9597
1573081480.947000, /eeg, 844.1392, 853.8095, 846.55676, 842.52747, 873.5531
1573081480.947000, /eeg, 848.97437, 853.00366, 851.79486, 853.00366, 856.2271
1573081480.948000, /eeg, 859.0476, 852.6007, 850.18317, 863.8828, 826.0073
1573081480.950000, /eeg, 859.0476, 851.79486, 853.00366, 866.30035, 819.5604
1573081480.950000, /eeg, 851.79486, 852.1978, 846.9597, 854.61536, 859.45056
1573081480.951000, /eeg, 856.63007, 853.00366, 846.55676, 840.9158, 854.21246
熊猫会产生此错误:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 6, saw 7
谢谢!
答案 0 :(得分:0)
您可以通过以下方式使用Miller(https://github.com/johnkerl/miller)标准化CSV并创建无错误CSV:
mlr --csv --implicit-csv-header unsparsify \
then rename 1,one,2,two \
then reshape -r "[0-9]" -o item,value \
then filter -x -S '$value==""' \
then put '$item=fmtnum(($item-2),"%03d");$item=$two."_".$item' \
then cut -x -f two then sort -f item -n one \
then reshape -s item,value \
then unsparsify input.csv >output.csv
您将拥有这样的CSV,可以导入
one /eeg_001 /eeg_002 /eeg_003 /eeg_004 /eeg_005 /elements/alpha_absolute_001 /elements/beta_absolute_001 /elements/delta_absolute_001 /elements/gamma_absolute_001 /elements/theta_absolute_001
1573081480.942000 844.1392 857.4359 849.3773 861.8681 890.07324 - - - - -
1573081480.943000 853.8095 853.8095 850.989 866.30035 854.61536 - - - - -
1573081480.944000 855.42126 855.0183 846.1539 852.1978 846.9597 - - - - -
1573081480.947000 848.97437 853.00366 851.79486 853.00366 856.2271 - - - - -
1573081480.948000 859.0476 852.6007 850.18317 863.8828 826.0073 - - - - -
1573081480.950000 851.79486 852.1978 846.9597 854.61536 859.45056 - - - - -
1573081480.951000 856.63007 853.00366 846.55676 840.9158 854.21246 - - - - -
1573081480.960000 - - - - - 0.48463312 0.061746284 - - -
1573081480.961000 - - - - - - - 0.7263172 0.7263172 0.7263172
答案 1 :(得分:0)
不清楚您想要什么。很好,您已经提供了示例输出,但是如果这是您输入的actault预期输出,那么会容易得多。
据我所知,最简单的方法是循环每种类型,找到它们使用多少列,创建许多框架,最后合并它们。像这样:
# Using pandas:
max_number_of_columns = pandas.read_csv('test.txt', sep='|', header=None)[0].str.count(',').max()
# or just hardcoded:
max_number_of_columns = 10
base = pandas.read_csv('test.txt', header=None, names=list(range(max_number_of_columns)))
base.columns = ['time','datatype'] + list(base.columns[2:])
results = [base.iloc[:,:2]]
for datatype in base['datatype'].unique():
group = base[base['datatype']==datatype].iloc[:,2:].dropna(how='all', axis=1)
group.columns = [f"{datatype}_{x}" for x in range(len(group.columns))]
results.append(group)
final = pandas.concat(results, axis=1)
编辑:修复第一行包含的列少于后几行的情况。