我已经清理了一个包含四列的 .csv 文件;所有数据都存在于第一列:
"Plot Title: 10862077 ",,,
"# ""Date Time"," GMT-04:00"" ""Temp", �C (LGR S/N: 10862077," SEN S/N: 10862077)"" Coupler Detached (LGR S/N: 10862077) Coupler Attached (LGR S/N: 10862077) Host Connected (LGR S/N: 10862077) Stopped (LGR S/N: 10862077) End Of File (LGR S/N: 10862077)"
"1 9/8/2016 15:47 23.256 ",,,
"2 9/8/2016 15:47 Logged ",,,
"3 9/8/2016 15:52 Logged Logged ",,,
"4 9/8/2016 15:53 Logged ",,,
"5 9/8/2016 16:02 22.681 ",,,
上面是原始文件,下面是我输出数据的方式 - 文本文件 - 由' \ n' 分隔:
('#\t"Date Time',)
('1\t9/8/2016 15:47\t23.256\t\t\t\t\t',)
('2\t9/8/2016 15:47\t\tLogged\t\t\t\t',)
('3\t9/8/2016 15:52\t\t\tLogged\tLogged\t\t',)
('4\t9/8/2016 15:53\t\tLogged\t\t\t\t',)
('5\t9/8/2016 16:02\t22.681\t\t\t\t\t',)
所需的输出看起来像这样,用 .csv 形式:
(Date, Time, Temperature)
(9/8/2016, 15:47, 23.256)
背景
我是python的新手(2017年6月开始学习),我正在帮助朋友清理这些数据用于研究。这些数据来自海洋中某处的温度传感器。我真的很感激一些帮助到达终点。
我已经搜索了方法,尽管我在这个项目中明显缺乏对python的曝光和经验。
我获得所需输出的初始方法是创建一个 if 语句来替换预定义的 \ t 或 \ t \ t >使用(,)并删除多个 \ t &记录。我已经从我的代码中删除了这些尝试,并逐渐发展为解决方案的内置函数( .replace 和 .rstrip 和 .split )无济于事。
我的代码
免责声明:一旦我离开测试阶段,我计划整理(制作更多pythonic)。以下是我到目前为止所写的内容,注释掉代码可能是失败的尝试或自己的笔记:
import pandas as pd
# Open data file and name it:
# Read file with PANDAS csv reader
# Make data into a DataFrame with PANDAS
# Close file
# Open file to write and name it:
# Iterate rows into tuples (for performance per docs), remove added name/index
# Strip out trailing, empty columns after C:1
# Write to new text file with '\n'
# Close file
with open('BAD_data.csv', 'r') as csvfile:
reader = pd.read_csv(csvfile)
data_frm = pd.DataFrame(reader)
csvfile.close()
with open('improved_data.txt', 'w') as imp_writeDat:
for row in data_frm.itertuples(index=False, name=None):
clean_row = str(row[:1])
imp_writeDat.write(clean_row + '\n')
imp_writeDat.close()
with open('improved_data.txt', 'r') as imp_readDat:
data2 = imp_readDat.read()
print data2.rstrip('\t')
# print data3.replace('\t\t\t\t\t', '')
# print imp_readDat.replace(' ', ',')
# print imp_readDat.replace('\t\t\tLogged\t\t\t', '')
# print imp_readDat.replace('\t\tLogged\t\t\t\t', '')
# print imp_readDat.replace('\t\t\tLogged\t\t\t', '')
# print imp_readDat.replace('\t\t\t\tLogged\t\t', '')
# print imp_readDat.replace('\t\t\t\t\tLogged\tLogged', '')
上面注释掉的代码在输出中没有产生任何差异。
答案 0 :(得分:1)
使用:
df = pd.read_csv('BAD_data.csv',
encoding='ISO-8859-1', #if not necessary omit it
sep='[\t+,]', #multiple separators
header=[0,1], #read first 2 rows to Multiindex
engine='python',
dtype=str) #all values convart to strings for NOT change floats
#remove " in first column
df.iloc[:, 0] = df.iloc[:, 0].str.strip('"')
#replace nan strings to NaN
df = df.replace('nan', np.nan)
#remove " and whitespaces in columns
a = df.columns.get_level_values(0).str.strip('" ')
a = np.where(a.str.startswith('Unnamed'), np.nan, a)
b = df.columns.get_level_values(1).str.strip('" ')
df.columns = [a,b]
#print (df.head())
#write to csv
df.to_csv('Good_data.csv')