熊猫提取多字符正则表达式

时间:2016-07-27 15:31:24

标签: python string pandas

我想每次在Pandas DataFrame的元素中作为数组出现时提取表达式,但每次使用多个字符表达式时都会出错。为什么我收到此错误?如何使提取按预期工作?

MWE

import pandas as pd

wiki = ["In theoretical computer the like operations.",
    "The a filter.",
    "In the.",
    "the dog is the one",
    "See below for details."
]
wiki

x = pd.DataFrame(wiki, columns = ['wiki'])
x

多字符表达式错误

x.wiki.str.extractall('(the)')

## x.wiki.str.extractall('(the)')
## Traceback (most recent call last):
## 
##   File "<ipython-input-7-ca5d102219f3>", line 1, in <module>
##     x.wiki.str.extractall('(the)')
## 
##   File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\strings.py", line 1621, in extractall
##     return str_extractall(self._orig, pat, flags=flags)
## 
##   File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\strings.py", line 716, in str_extractall
##     result = DataFrame(match_list, index, columns)
## 
##   File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 263, in __init__
##     arrays, columns = _to_arrays(data, columns, dtype=dtype)
## 
##   File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 5352, in _to_arrays
##     dtype=dtype)
## 
##   File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 5431, in _list_to_arrays
##     coerce_float=coerce_float)
## 
##   File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 5489, in _convert_object_array
##     'columns' % (len(columns), len(content)))
## 
## AssertionError: 1 columns passed, passed data had 3 columns

正如预期的单个字符表达式

x.wiki.str.extractall('(t)')

## x.wiki.str.extractall('(t)')
## Out[8]: 
##          0
##   match   
## 0 0      t
##   1      t
##   2      t
##   3      t
##   4      t
## 1 0      t
## 2 0      t
## 3 0      t
##   1      t
## 4 0      t

我期待这个:

  match   
0 0      the
  1      the
2 0      the
3 0      the
  1      the

1 个答案:

答案 0 :(得分:1)

extractall()方法有一个bug应该在pandas 0.18.2中修复,这应该很快就会发布,所以让我们耐心或冒一点风险并使用beta 0.18.2 version ...;)