Question

我有一个数据集，其中包含来自Twitter的推文。他们中的一些人还提到了用户，例如@thisisauser。我尝试在执行其他清理过程的同时删除该文本。

def clean_text(row, options):

    if options['lowercase']:
        row = row.lower()

    if options['decode_html']:
        txt = BeautifulSoup(row, 'lxml')
        row = txt.get_text()

    if options['remove_url']:
        row = row.replace('http\S+|www.\S+', '')

    if options['remove_mentions']:
        row = row.replace('@[A-Za-z0-9]+', '')

    return row

clean_config = {
    'remove_url': True,
    'remove_mentions': True,
    'decode_utf8': True,
    'lowercase': True
    }

df['tweet'] = df['tweet'].apply(clean_text, args=(clean_config,))

但是，当我运行上面的代码时，所有Twitter提及内容仍在文本中。我使用Regex在线工具验证了Regex是否可以正常工作，所以问题应该出在熊猫的代码上。

Answer 1

您误用了字符串上的replace方法，因为它不接受正则表达式，仅接受固定的字符串（有关更多信息，请参见https://docs.python.org/2/library/stdtypes.html#str.replace上的文档）。

满足需求的正确方法是使用re模块，例如：

import re
re.sub("@[A-Za-z0-9]+","", "@thisisauser text")
' text'

Answer 2

问题出在您使用替换方法而不是熊猫的方式上

查看REPL的输出

>>> my_str ="@thisisause"
>>> my_str.replace('@[A-Za-z0-9]+', '')
'@thisisause'

replace不支持正则表达式。而是按照library

所述在python中使用正则表达式answer

>>> import re
>>> my_str
'hello @username hi'
>>> re.sub("@[A-Za-z0-9]+","",my_str)
'hello  hi'

从熊猫专栏中删除Twitter提及

2 个答案: