用正则表达式创建一个pandas列?

时间:2014-06-07 15:12:46

标签: python pandas

我有一个庞大的字符串列。我想使用正则表达式将字符串的各个组件分解为各自的列:

In [35]:

import re
import pandas as pd

In [36]:

data = {'raw': ['Baker 1 2009-11-17       1223.0',
                'Baker 1 2010-06-24       1122.7',
                'Baker 2 2009-07-24       2819.0',
                'Baker 2 2010-08-25       2971.6',
                'Baker 1 2011-01-05       1410.0',
                'Baker 2 2010-09-04       4671.6']}
df = pd.DataFrame(data, columns = ['raw'])
df

Out[36]:
     raw
0    Baker 1 2009-11-17 1223.0
1    Baker 1 2010-06-24 1122.7
2    Baker 2 2009-07-24 2819.0
3    Baker 2 2010-08-25 2971.6
4    Baker 1 2011-01-05 1410.0
5    Baker 2 2010-09-04 4671.6

这就是我想要的样子:

Out[41]:
     name    value   date          score
0    Baker   1       2010-06-24    1223.0
1    Baker   1       2009-07-24    1122.7
2    Baker   2       2009-07-24    2819.0
3    Baker   2       2010-08-25    2971.6
4    Baker   1       2011-01-05    1410.0
5    Baker   2       2010-09-04    4671.6

我已经尝试过data.str.contains()但我似乎无法使它工作。任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:3)

根据这个答案 - Pandas DataFrame - how do I split a column

In [122]: pd.DataFrame(df['raw'].str.split().tolist(), columns=['name','value','date','score'])
Out[122]: 
    name value        date   score
0  Baker     1  2009-11-17  1223.0
1  Baker     1  2010-06-24  1122.7
2  Baker     2  2009-07-24  2819.0
3  Baker     2  2010-08-25  2971.6
4  Baker     1  2011-01-05  1410.0
5  Baker     2  2010-09-04  4671.6

[6 rows x 4 columns]

答案 1 :(得分:1)

是否要求使用正则表达式?在这里使用正则表达式太复杂了,因为您拥有的结构化数据很容易被read_csv解析。也就是说,除了@ chrisb的答案之外,还有几种方法可以做到这一点:

StringIO + read_csv

In [45]: data
Out[45]:
{'raw': ['Baker 1 2009-11-17       1223.0',
  'Baker 1 2010-06-24       1122.7',
  'Baker 2 2009-07-24       2819.0',
  'Baker 2 2010-08-25       2971.6',
  'Baker 1 2011-01-05       1410.0',
  'Baker 2 2010-09-04       4671.6']}

In [46]: text = '\n'.join(data['raw'])

In [47]: print(text)
Baker 1 2009-11-17       1223.0
Baker 1 2010-06-24       1122.7
Baker 2 2009-07-24       2819.0
Baker 2 2010-08-25       2971.6
Baker 1 2011-01-05       1410.0
Baker 2 2010-09-04       4671.6

In [48]: from StringIO import StringIO

In [49]: df = pd.read_csv(StringIO(text), sep=r'\s+', parse_dates=[2], names=['name', 'value', 'date', 'score'])

In [50]: df
Out[50]:
    name  value       date   score
0  Baker      1 2009-11-17  1223.0
1  Baker      1 2010-06-24  1122.7
2  Baker      2 2009-07-24  2819.0
3  Baker      2 2010-08-25  2971.6
4  Baker      1 2011-01-05  1410.0
5  Baker      2 2010-09-04  4671.6

In [51]: df.dtypes
Out[51]:
name             object
value             int64
date     datetime64[ns]
score           float64
dtype: object

这允许您提供名称​​和推断dtypes。我会把这个放在其他人之上。

Series.str.extract()

注意:您可能不应该使用score正则表达式匹配任意浮点数(例如,它与负数匹配):查看tokenize.Floatnumber

In [29]: df
Out[29]:
                               raw
0  Baker 1 2009-11-17       1223.0
1  Baker 1 2010-06-24       1122.7
2  Baker 2 2009-07-24       2819.0
3  Baker 2 2010-08-25       2971.6
4  Baker 1 2011-01-05       1410.0
5  Baker 2 2010-09-04       4671.6

In [30]: raw = df.raw.str.extract(r'(?P<name>[a-zA-Z]+)\s+(?P<value>\d+)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<score>\d*\.\d*)')

In [31]: raw
Out[31]:
    name value        date   score
0  Baker     1  2009-11-17  1223.0
1  Baker     1  2010-06-24  1122.7
2  Baker     2  2009-07-24  2819.0
3  Baker     2  2010-08-25  2971.6
4  Baker     1  2011-01-05  1410.0
5  Baker     2  2010-09-04  4671.6

In [32]: raw.dtypes
Out[32]:
name     object
value    object
date     object
score    object
dtype: object

In [33]: r = raw.convert_objects(convert_numeric=True)

In [34]: r
Out[34]:
    name  value        date   score
0  Baker      1  2009-11-17  1223.0
1  Baker      1  2010-06-24  1122.7
2  Baker      2  2009-07-24  2819.0
3  Baker      2  2010-08-25  2971.6
4  Baker      1  2011-01-05  1410.0
5  Baker      2  2010-09-04  4671.6

In [35]: r.dtypes
Out[35]:
name      object
value      int64
date      object
score    float64
dtype: object

注意:这不会转换date列。请使用pandas.to_datetime