我有一个庞大的字符串列。我想使用正则表达式将字符串的各个组件分解为各自的列:
In [35]:
import re
import pandas as pd
In [36]:
data = {'raw': ['Baker 1 2009-11-17 1223.0',
'Baker 1 2010-06-24 1122.7',
'Baker 2 2009-07-24 2819.0',
'Baker 2 2010-08-25 2971.6',
'Baker 1 2011-01-05 1410.0',
'Baker 2 2010-09-04 4671.6']}
df = pd.DataFrame(data, columns = ['raw'])
df
Out[36]:
raw
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
这就是我想要的样子:
Out[41]:
name value date score
0 Baker 1 2010-06-24 1223.0
1 Baker 1 2009-07-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
我已经尝试过data.str.contains()但我似乎无法使它工作。任何帮助将不胜感激。
答案 0 :(得分:3)
根据这个答案 - Pandas DataFrame - how do I split a column
In [122]: pd.DataFrame(df['raw'].str.split().tolist(), columns=['name','value','date','score'])
Out[122]:
name value date score
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
[6 rows x 4 columns]
答案 1 :(得分:1)
是否要求使用正则表达式?在这里使用正则表达式太复杂了,因为您拥有的结构化数据很容易被read_csv
解析。也就是说,除了@ chrisb的答案之外,还有几种方法可以做到这一点:
StringIO
+ read_csv
:In [45]: data
Out[45]:
{'raw': ['Baker 1 2009-11-17 1223.0',
'Baker 1 2010-06-24 1122.7',
'Baker 2 2009-07-24 2819.0',
'Baker 2 2010-08-25 2971.6',
'Baker 1 2011-01-05 1410.0',
'Baker 2 2010-09-04 4671.6']}
In [46]: text = '\n'.join(data['raw'])
In [47]: print(text)
Baker 1 2009-11-17 1223.0
Baker 1 2010-06-24 1122.7
Baker 2 2009-07-24 2819.0
Baker 2 2010-08-25 2971.6
Baker 1 2011-01-05 1410.0
Baker 2 2010-09-04 4671.6
In [48]: from StringIO import StringIO
In [49]: df = pd.read_csv(StringIO(text), sep=r'\s+', parse_dates=[2], names=['name', 'value', 'date', 'score'])
In [50]: df
Out[50]:
name value date score
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
In [51]: df.dtypes
Out[51]:
name object
value int64
date datetime64[ns]
score float64
dtype: object
这允许您提供名称和推断dtypes。我会把这个放在其他人之上。
Series.str.extract()
注意:您可能不应该使用score
正则表达式匹配任意浮点数(例如,它与负数匹配):查看tokenize.Floatnumber
In [29]: df
Out[29]:
raw
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
In [30]: raw = df.raw.str.extract(r'(?P<name>[a-zA-Z]+)\s+(?P<value>\d+)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<score>\d*\.\d*)')
In [31]: raw
Out[31]:
name value date score
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
In [32]: raw.dtypes
Out[32]:
name object
value object
date object
score object
dtype: object
In [33]: r = raw.convert_objects(convert_numeric=True)
In [34]: r
Out[34]:
name value date score
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
In [35]: r.dtypes
Out[35]:
name object
value int64
date object
score float64
dtype: object
注意:这不会转换date
列。请使用pandas.to_datetime
。