清理熊猫数据框中的持续时间数据

时间:2018-09-17 15:16:43

标签: python pandas

我正在使用Python 2.7.14和Pandas解析* .csv格式的数据集。我对基于某些属性分组的持续时间统计分析感兴趣。问题是,每当我尝试对中位数进行分组和应用时,都会出错

TypeError: invalid literal for float(): 00:00:11

持续时间以字符串形式给出,格式为HH:MM:SS。我尝试将其转换为timedelta或datetime,但每次出现此错误时。 我能够在csv文件中找到引发错误的第一行,修改工期值,并观察到引发的错误会相应地更新(即00:00:11变成了我写的东西)。错误是源csv中有数千行。删除该行会使错误移至另一行,而不是移至另一行的正下方。

我被困住了:我无法执行自动分析,也无法找到一种解决方法来识别不良类型的行并将其删除。

我进行了一些研究,发现这是一个通用线程,但是提供的每个解决方案都是针对单个问题的,都不符合我的需求。

这是我最近得到的代码:

进口熊猫 导入numpy

df = pandas.read_csv('log.csv',sep=",", quotechar="\"", names=["attrA","attrB","attrC","attrD","attrE","attrF","attrG","skip1","attrH","attrI","attrJ","attrL","attrM","attrN","attrO","attrP","attrQ","Duration","User","Host"], encoding="utf-8-sig")
group_host = df.groupby('Host')
for host, value in group_host['Duration']:
    print((host, value.median()))

这是数据集的一个示例(敏感数据被遮盖了):

"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:03:10","userA","hostA"
"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:00:34","userA","hostB"
"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:00:54","userA","hostA"
"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:02:30","userA","hostB"
"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:03:04","userA","hostA"
"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:00:44","userA","hostA"
"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:01:09","userA","hostB"
"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:00:42","userA","hostA"
"xxx","xxx","xxx","xxx","xxx","xxx","xxx","","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","00:00:11","userA","hostB"

1)如何清除输入数据?   1a)我尝试添加df.dropna(subset=['Duration'])和/或pandas.to_timedelta(df.Duration,errors='coerce'),但没有任何运气。 2)或者,我如何手工识别线条并将其删除?

注意:数据集是使用自动化脚本从多个不同文件中聚合而成的。病历线不在文件的开头也不在文件的结尾。

0 个答案:

没有答案