我在csv文件中读取日期时间列,该列具有随机散布的非日期时间文本块(一次一行中有5行,有时连续多个块)。请参阅下面的数据文件剪切示例:
日期,时间,次数,故障,电池 22分之12/ 2015,05:24.0,39615.0,0.0,6.42 22分之12/ 2015,05:25.0,39616.0,0.0,6.42 22分之12/ 2015,05:26.0,39617.0,0.0,6.42 22分之12/ 2015,05:27.0,39618.0,0.0,6.42 ,,,, Sonde STSO3275 ,,,, RMR ,,,, 默认网站,,,, X2CMBasicOpticsBurst ,,,, ,,,, Sonde STSO3275 ,,,, RMR ,,,, 默认网站,,,, X2CMBasicOpticsBurst ,,,, 22分之12/ 2015,19:57.0,39619.0,0.0,6.42 22分之12/ 2015,19:58.0,39620.0,0.0,6.42 22分之12/ 2015,19:59.0,39621.0,0.0,6.42 22分之12/ 2015,20:00.0,39622.0,0.0,6.42 22分之12/ 2015,20:01.0,39623.0,0.0,6.42 22分之12/ 2015,20:02.0,39624.0,0.0,6.42
我可以从剪贴板和数据框中读取数据,如下所示:
df = pd.read_clipboard(sep=',')
我正在寻找一种方法来清理日期'转换为日期时间索引之前的非日期格式化字符串列。我已经尝试将列转换为索引,然后转换为列表并进行过滤,如下所示:
df.index=df['Date']
df = df[~df.index.get_loc('RMR')]
df = df[~df.index.get_loc('Default Site')]
df = df[~df.index.get_loc('X2CMBasicOpticsBurst')]
df = df[~df.index.get_loc('Sonde STSO3275')]
df = df.dropna()
然后,我可以一起解析日期和时间,并使用日期分析工具获取正确的日期时间索引。 但是,文本字段的内容可能会发生变化,这种方法看起来非常有限且非pythonic。
因此,我正在寻找一种更好,更灵活和动态的方法来自动跳过索引中的这些非日期字段,希望无需知道其内容的详细信息(例如,当跳过4行块时遇到空白行。)
提前致谢。
答案 0 :(得分:0)
好吧,你可以使用to_datetime
df.loc[:, 'Date'] = pd.to_datetime(df.Date, errors='coerce')
不是日期时间的元素将转换为NaT 然后你可以放弃它。
df = df.dropna()
答案 1 :(得分:0)
我认为您可以read_csv
和dropna
使用to_datetime
:
import pandas as pd
import io
temp=u"""Date,Time,Count,Fault,Battery
12/22/2015,05:24.0,39615.0,0.0,6.42
12/22/2015,05:25.0,39616.0,0.0,6.42
12/22/2015,05:26.0,39617.0,0.0,6.42
12/22/2015,05:27.0,39618.0,0.0,6.42
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
12/22/2015,19:57.0,39619.0,0.0,6.42
12/22/2015,19:58.0,39620.0,0.0,6.42
12/22/2015,19:59.0,39621.0,0.0,6.42
12/22/2015,20:00.0,39622.0,0.0,6.42
12/22/2015,20:01.0,39623.0,0.0,6.42
12/22/2015,20:02.0,39624.0,0.0,6.42"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['Date','Time']])
df = df.dropna()
df['Date_Time'] = pd.to_datetime(df.Date_Time, format="%m/%d/%Y %H:%M.%S")
print df
Date_Time Count Fault Battery
0 2015-12-22 05:24:00 39615.0 0.0 6.42
1 2015-12-22 05:25:00 39616.0 0.0 6.42
2 2015-12-22 05:26:00 39617.0 0.0 6.42
3 2015-12-22 05:27:00 39618.0 0.0 6.42
14 2015-12-22 19:57:00 39619.0 0.0 6.42
15 2015-12-22 19:58:00 39620.0 0.0 6.42
16 2015-12-22 19:59:00 39621.0 0.0 6.42
17 2015-12-22 20:00:00 39622.0 0.0 6.42
18 2015-12-22 20:01:00 39623.0 0.0 6.42
19 2015-12-22 20:02:00 39624.0 0.0 6.42