Question

我有一个csv，我想读入一个pandas数据帧并进行分析。一列名为＆＃39; Date＆＃39;，可以使用以下命令轻松转换为日期时间类型：

'Passnumber:123 19-05-2016 21:58 Transactie:123A12 Term:AABBC'

但是该列不包含与该行关联的时间。时间（由于某种未知原因）包含在另一列的字符串中，相当于＆＃39;注释＆＃39;柱。 “评论”中的示例条目＆＃39;列看起来类似于以下字符串：

class Ball {

  constructor(bX, bY, bRadius, bcolor, dX, dY) {
    this.bX = bX;
    this.bY = bY;
    this.bRadius = bRadius;
    this.bcolor = bcolor;
    this.dX = dX;
    this.dY = dY;
  }

  drawBall() {
    //
  }

  moveBall() {
    //
  }

  bounce() {
    //
  }

  collisions(Ball ball2) { //don't know how to refer to Ball & ball2
    var deltaX = this.bX - ball2.bX;
    var deltaY = this.bY - ball2.bY;
    // ....
    if (sqDistance <= sqRadius) {
      alert("going to hit!");
  }
};

我想在“Transactie”这个词之前抽出时间，在这种情况下是21:58。这可能在熊猫中做，还是我需要一个更通用的正则表达包？

Answer 1

您可以使用pandas string manipulation pd.Series.str向量函数。例如，

In[1]: df = pd.DataFrame({"Date": ["20160519", "20160519"], 
"Datum": ['Passnumber:123 19-05-2016 21:58 Transactie:123A12 Term:AABBC', 
          'Passnumber:123 19-05-2016 22:58 Transactie:123A12 Term:AABBC']})

In[2]: df.Datum.astype(str).str.split(pat=' ', expand=True)[2]
Out[2]: 
0    21:58
1    22:58
Name: 2, dtype: object

Answer 2

您可以利用pandas允许您沿列应用任何功能的事实！我发现自己经常从pandas文档中做.apply(lambda x: function(x)) Here is a relevant example。

在您的情况下，您可以执行以下操作：

def datum_to_datetime(row):
    time = row['Datum'].split()[-3]

    return time

df.apply(datum_to_datetime)

Answer 3

您可以使用str.extract或str.split任意空格\s+：

import pandas as pd

df = pd.DataFrame({'Datum': ['Passnumber:123 19-05-2016 21:58 Transactie:123A12 Term:AABBC',
                            'Passnumber:123 19-05-2016 21:58 Transactie:123A12 Term:AABBC']})

print (df)
                                               Datum
0  Passnumber:123 19-05-2016 21:58 Transactie:123...
1  Passnumber:123 19-05-2016 21:58 Transactie:123...

df['Time'] = df.Datum.str.extract(r'([0-2]\d:[0-5]\d)', expand=True)

print (df)
                                               Datum   Time
0  Passnumber:123 19-05-2016 21:58 Transactie:123...  21:58
1  Passnumber:123 19-05-2016 21:58 Transactie:123...  21:58

print (df.Datum.str.split(r'\s+', expand=True)[2])
0    21:58
1    21:58
Name: 2, dtype: object

测试regex。

似乎extract方法最快：

In [408]: %timeit (df.Datum.str.extract(r'([0-2]\d:[0-5]\d)', expand=True))
The slowest run took 4.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 347 µs per loop

In [409]: %timeit (df.Datum.str.split(r'\s+', expand=True)[2])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 675 µs per loop

In [410]: %timeit (df.Datum.astype(str).str.split(pat=' ', expand=True)[2])
The slowest run took 4.73 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 727 µs per loop

使用Pandas在csv中抓取时间戳

3 个答案: