Question

我确定这是一个非常简单的regexp问题，但是：尝试在pandas中使用str.match来匹配非ASCII字符（时间符号）。我希望第一个匹配调用将匹配DataFrame的第一行;第二个匹配调用将匹配最后一行;第三个匹配将匹配第一行和最后一行。但是，第一个呼叫匹配，但第二个和第三个呼叫不匹配。我哪里错了？

数据框看起来像（用x替换时间符号，它实际打印为？）：

  Column
0  2x 32
1     42
2  64 x2

Pandas 0.20.3，python 2.7.13，OS X。

#!/usr/bin/env python

import pandas as pd
import re

html = '<table><thead><tr><th>Column</th></tr></thead><tbody><tr><td>2&#215; 32</td></tr><tr><td>42</td></tr><tr><td>64 &#215;2</td></tr></tbody><table>'
df = pd.read_html(html)[0]
print df
print df[df['Column'].str.match(ur'^[2-9]\u00d7', re.UNICODE, na=False)]
print df[df['Column'].str.match(ur'\u00d7[2-9]$', re.UNICODE, na=False)]
print df[df['Column'].str.match(ur'\u00d7', re.UNICODE, na=False)]

输出我看到了（再次用？替换为x）：

  Column
0  2x 32
Empty DataFrame
Columns: [Column]
Index: []
Empty DataFrame
Columns: [Column]
Index: []

Answer 1

使用contains()：

df.Column.str.contains(r'^[2-9]\u00d7')
0     True
1    False
2    False
Name: Column, dtype: bool

df.Column.str.contains(r'\u00d7[2-9]$')
0    False
1    False
2     True
Name: Column, dtype: bool

df.Column.str.contains(r'\u00d7')
0     True
1    False
2     True
Name: Column, dtype: bool

说明：contains()使用re.search()，match()使用re.match()（docs）。由于re.match()仅匹配字符串（docs）的开头，因此只有您在开始时匹配的第一个案例（使用^）才有效。实际上，在这种情况下，您不需要match和^：

df.Column.str.match(r'[2-9]\u00d7')
0     True
1    False
2    False
Name: Column, dtype: bool

大熊猫匹配＆＃39;关于非ascii字符

1 个答案: