Question

我正在尝试从数据框中删除所有标点符号，但字符'<'和'>'

除外

我尝试过：

def non_punct(df):

    df['C'] = df['C'].str.replace('[^\w\s]' | ~(<) | ~(>),' ')

    return df

输出：

    File "<ipython-input-292-ac8369672f62>", line 3
        df['Description'] = df['Description'].str.replace('[^\w\s]' | ~(<) | ~(>),' ')
                                                                ^
SyntaxError: invalid syntax

我的数据框：

       A          B                                    C
  French      house               Phone. <phone_numbers>
 English      house               email - <adresse_mail>
  French  apartment                      my name is Liam
  French      house                        Hello George!
 English  apartment   Ethan, my phone is <phone_numbers>

好的输出：

       A          B                                    C
  French      house               Phone <phone_numbers>
 English      house               email  <adresse_mail>
  French  apartment                     my name is Liam
  French      house                        Hello George 
 English  apartment   Ethan my phone is <phone_numbers>

Answer 1

这是一种通过re.sub获得结果的方法。另外，我认为您的正则表达式已关闭，应该为[[^\w\s^<^>]|_。这将匹配所有非数字，整数，空格，<或>的内容。您必须明确匹配下划线，因为\w中的下划线是免除的。

import re
re.sub('[^\w\s^<^>]|_', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf      ><a b        a'

作为比较：

re.sub('[^\w\s] | ~(<) | ~(>)', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf.,:;/\\><a b_?!"§$%&a'

re.sub('[^\w\s^<^>]', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf      ><a b_       a'

编辑：您的错误是由于引号放错了：应该是'[^\w\s] | ~(<) | ~(>)'而不是'[^\w\s]' | ~(<) | ~(>)

编辑2：如@Brad Solomon所指出的，pd.Series.str.replace与正则表达式非常匹配，因此在语句中添加[[^\w\s^<^>]|_作为匹配的模式应该可以解决问题。还没有测试过。 @marin：如果您碰巧尝试一下，请给我反馈，以便我可以根据需要更新帖子。

Answer 2

这里是string.punctuation的一种方式：

>>> import re
>>> import string

>>> import pandas as pd

>>> df = pd.DataFrame({
...     'a': ['abc', 'de.$&$*f(@)<', '<g>hij<k>'],
...     'b': [1234, 5678, 91011],
...     'c': ['me <me@gmail.com>', '123 West-End Lane', '<<xyz>>']
... })

>>> punc = string.punctuation.replace('<', '').replace('>', '')

>>> pat = re.compile(f'[{punc}]')
>>> df.replace(pat, '')
           a      b                 c
0        abc   1234   me <megmailcom>
1       def<   5678  123 WestEnd Lane
2  <g>hij<k>  91011           <<xyz>>

您应仔细检查此常量是否包含您想要的内容：

被视为标点符号的ASCII字符字符串在C语言环境中。

值：

>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> string.punctuation.replace('<', '').replace('>', '')
'!"#$%&\'()*+,-./:;=?@[\\]^_`{|}~'

注意：

此解决方案使用f-string（Python 3.6 +）
它将这些文字字符括在character set中以匹配它们中的任何一个
请注意df.replace()和df[my_column_name].str.replace()之间的区别。 pd.DataFrame.replace()的签名是DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')，其中to_replace可以是正则表达式。

Answer 3

在单行中（除cat之外）为：

import

删除数据框中的所有标点符号，但不包括某些字符

3 个答案: