如何删除非字母字符的每个单词

时间:2017-09-29 09:44:36

标签: python python-2.7 python-3.x grammar

我需要编写一个python脚本,用非字母字符删除文本文件中的每个单词,以便测试Zipf的定律。 例如:

asdf@gmail.com said: I've taken 2 reports to the boss

taken reports to the boss

我该怎么办?

8 个答案:

答案 0 :(得分:5)

使用正则表达式仅匹配字母(和下划线),您可以这样做:

import re

s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'

答案 1 :(得分:2)

试试这个:

sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']

result = ' '.join(words)
# taken reports to the boss

答案 2 :(得分:2)

您可以使用split() isalpha() 来获取仅包含字母字符并且至少有一个字符的字词列表。

>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

然后,您可以使用join()将列表转换为一个字符串:

>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss

答案 3 :(得分:2)

nltk包专门用于处理文字,并具有各种功能,您可以使用这些功能进行标记化处理'文字到文字。

您可以使用RegexpTokenizerword_tokenize进行轻微调整。

最简单,最简单的是RegexpTokenizer

import nltk

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

返回:

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`

或者您可以使用稍微聪明的word_tokenize,它可以将didn't等大部分收缩分成didn't

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

返回:

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']

答案 4 :(得分:0)

可能会有所帮助

array = string.split(' ')
result = []
for word in array
 if word.isalpha()
  result.append(word)
string = ' '.join(result)

答案 5 :(得分:0)

您可以使用正则表达式,也可以在构建函数中使用python,例如isalpha()

使用isalpha()

的示例
result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
    if i.isalpha():
        print(i+' ',end='')

答案 6 :(得分:0)

str.join() +理解将为您提供一个单行解决方案:

sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'

答案 7 :(得分:0)

我最终为此编写了自己的函数,因为正则表达式和 isalpha() 不适用于我拥有的测试用例。

letters = set('abcdefghijklmnopqrstuvwxyz')

def only_letters(word):
    for char in word.lower():
        if char not in letters:
            return False
    return True

# only 'asdf' is valid here
hard_words = ['ís', 'る', '<|endoftext|>', 'asdf']

print([x for x in hard_words if only_letters(x)])
# prints ['asdf']