Question

我有一个像下面这样的问题列表，我想从这个问题列表中删除所有特殊字符，数字，并且我想从这个问题列表中进行标记化和停止单词删除：

    issue=[[hi iam !@going $%^ to uk&*(us \\r\\ntomorrow {morning} by 
            the_way two-three!~`` [problems]:are there;]
           [happy"journey" (and) \\r\\n\\rbring 576 chachos?>]]

我尝试了下面的代码，但没有得到想要的输出：

import re
ab=re.sub('[^A-Za-z0-9]+', '', issue)
bc=re.split(r's, ab)

我希望看到如下输出：

issue_output=[['hi','going','uk','us','tomorrow','morning',
                'way','two','three','problems' ]
              [ 'happy','journey','bring','chachos']]

Answer 1

您发布的代码有两个明显的问题。首先是您的输入列表issue的格式不正确，导致无法解析。根据您实际希望格式化的方式，对问题的答案可能会有所变化，但是总的来说，这会导致第二个问题，即您正在尝试对列表进行re.sub。您要对列表的元素进行替换。您可以为此使用列表理解：

issue_output = [re.sub(r'[^A-Za-z0-9]+', ' ', item) for item in issue]

由于问题中没有提供有效的Python列表，因此我将根据我的最佳猜测来假定列表中的值。

issue = [
          ['hi iam !@going $%^ to uk&*(us \\r\\ntomorrow {morning} by the_way two-three!~`` [problems]:are there;'], 
          ['happy"journey" (and) \\r\\n\\rbring 576 chachos?>']
      ]

在这种情况下，当您有一个字符串列表列表时，需要为此调整列表理解。

cleaned_issue = [[re.sub(r'[^A-Za-z0-9]+', ' ', item) for item in inner_list] for inner_list in issue]

这将返回一个列表列表，其中包含字符串：

[['hi iam going to uk us r ntomorrow morning by the way two three problems are there '], ['happy journey and r n rbring 576 chachos ']]

如果要在列表中使用单独的单词，只需在替换后split()即可。

tokenized_issue = [[re.sub(r'[^A-Za-z0-9]+', ' ', item.split()) for item in inner_list][0] for inner_list in issue]

这给出了以下结果：

[['hi', 'iam', 'going', 'to', 'uk', 'us', 'r', 'ntomorrow', 'morning', 'by', 'the', 'way', 'two', 'three', 'problems', 'are', 'there'], ['happy', 'journey', 'and', 'r', 'n', 'rbring', '576', 'chachos']]

删除所有特殊字符和数字并停用词

1 个答案: