我有一个csv,我想读入一个pandas数据帧并进行分析。一列名为' Date',可以使用以下命令轻松转换为日期时间类型:
'Passnumber:123 19-05-2016 21:58 Transactie:123A12 Term:AABBC'
但是该列不包含与该行关联的时间。时间(由于某种未知原因)包含在另一列的字符串中,相当于'注释'柱。 “评论”中的示例条目'列看起来类似于以下字符串:
class Ball {
constructor(bX, bY, bRadius, bcolor, dX, dY) {
this.bX = bX;
this.bY = bY;
this.bRadius = bRadius;
this.bcolor = bcolor;
this.dX = dX;
this.dY = dY;
}
drawBall() {
//
}
moveBall() {
//
}
bounce() {
//
}
collisions(Ball ball2) { //don't know how to refer to Ball & ball2
var deltaX = this.bX - ball2.bX;
var deltaY = this.bY - ball2.bY;
// ....
if (sqDistance <= sqRadius) {
alert("going to hit!");
}
};
我想在“Transactie”这个词之前抽出时间,在这种情况下是21:58。这可能在熊猫中做,还是我需要一个更通用的正则表达包?
答案 0 :(得分:3)
您可以使用pandas string manipulation pd.Series.str
向量函数。例如,
In[1]: df = pd.DataFrame({"Date": ["20160519", "20160519"],
"Datum": ['Passnumber:123 19-05-2016 21:58 Transactie:123A12 Term:AABBC',
'Passnumber:123 19-05-2016 22:58 Transactie:123A12 Term:AABBC']})
In[2]: df.Datum.astype(str).str.split(pat=' ', expand=True)[2]
Out[2]:
0 21:58
1 22:58
Name: 2, dtype: object
答案 1 :(得分:1)
您可以利用pandas允许您沿列应用任何功能的事实!我发现自己经常从pandas文档中做.apply(lambda x: function(x))
Here is a relevant example。
在您的情况下,您可以执行以下操作:
def datum_to_datetime(row):
time = row['Datum'].split()[-3]
return time
df.apply(datum_to_datetime)
答案 2 :(得分:1)
您可以使用str.extract
或str.split
任意空格\s+
:
import pandas as pd
df = pd.DataFrame({'Datum': ['Passnumber:123 19-05-2016 21:58 Transactie:123A12 Term:AABBC',
'Passnumber:123 19-05-2016 21:58 Transactie:123A12 Term:AABBC']})
print (df)
Datum
0 Passnumber:123 19-05-2016 21:58 Transactie:123...
1 Passnumber:123 19-05-2016 21:58 Transactie:123...
df['Time'] = df.Datum.str.extract(r'([0-2]\d:[0-5]\d)', expand=True)
print (df)
Datum Time
0 Passnumber:123 19-05-2016 21:58 Transactie:123... 21:58
1 Passnumber:123 19-05-2016 21:58 Transactie:123... 21:58
print (df.Datum.str.split(r'\s+', expand=True)[2])
0 21:58
1 21:58
Name: 2, dtype: object
测试regex。
似乎extract
方法最快:
In [408]: %timeit (df.Datum.str.extract(r'([0-2]\d:[0-5]\d)', expand=True))
The slowest run took 4.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 347 µs per loop
In [409]: %timeit (df.Datum.str.split(r'\s+', expand=True)[2])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 675 µs per loop
In [410]: %timeit (df.Datum.astype(str).str.split(pat=' ', expand=True)[2])
The slowest run took 4.73 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 727 µs per loop