我对NLP
较陌生,所以请保持谦虚。一世
自上任以来拥有特朗普推文中文本的完整列表,我正在标记文本以分析内容。
我正在使用python TweetTokenizer
库中的nltk
,并且试图将除数字和标点符号之外的所有内容标记化。问题是我的代码删除了除一个之外的所有令牌。
我尝试使用。isalpha()
方法,但是这种方法不起作用,我认为应该只对由字母组成的字符串为True。
#Create a content from the tweets
text= non_re['text']
#Make all text in lowercase
low_txt= [l.lower() for l in text]
#Iteratively tokenize the tweets
TokTweet= TweetTokenizer()
tokens= [TokTweet.tokenize(t) for t in low_txt
if t.isalpha()]
我的输出只是一个令牌。
如果删除if t.isalpha()
语句,那么我会得到所有标记,包括数字和标点符号,这表明isalpha()
是过度修饰的罪魁祸首。
我想要的是一种从tweet文本中获取令牌而无需标点符号和数字的方法。 感谢您的帮助!
答案 0 :(得分:1)
尝试如下操作:
import string
import re
import nltk
from nltk.tokenize import TweetTokenizer
tweet = "first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play"
def clean_text(text):
# remove numbers
text_nonum = re.sub(r'\d+', '', text)
# remove punctuations and convert characters to lower case
text_nopunct = "".join([char.lower() for char in text_nonum if char not in string.punctuation])
# substitute multiple whitespace with single whitespace
# Also, removes leading and trailing whitespaces
text_no_doublespace = re.sub('\s+', ' ', text_nopunct).strip()
return text_no_doublespace
cleaned_tweet = clean_text(tweet)
tt = TweetTokenizer()
print(tt.tokenize(cleaned_tweet))
输出:
['first', 'think', 'another', 'disney', 'movie', 'might', 'good', 'its', 'kids', 'movie', 'watch', 'it', 'cant', 'help', 'enjoy', 'it', 'ages', 'love', 'movie', 'first', 'saw', 'movie', 'years', 'later', 'still', 'love', 'it', 'danny', 'glover', 'superb', 'could', 'play']
答案 1 :(得分:0)
# Function for removing Punctuation from Text and It gives total no.of punctuation removed also
# Input: Function takes Existing fie name and New file name as string i.e 'existingFileName.txt' and 'newFileName.txt'
# Return: It returns two things Punctuation Free File opened in read mode and a punctuation count variable.
def removePunctuation(tokenizeSampleText, newFileName):
from nltk.tokenize import word_tokenize
existingFile = open(tokenizeSampleText, 'r')
read_existingFile = existingFile.read()
tokenize_existingFile = word_tokenize(read_existingFile)
puncRemovedFile = open(newFileName, 'w+')
import string
stringPun = list(string.punctuation)
count_pun = 0
for word in tokenize_existingFile:
if word in stringPun:
count_pun += 1
else:
word = word + ' '
puncRemovedFile.write(''.join(word))
existingFile.close()
puncRemovedFile.close()
return open(newFileName, 'r'), count_pun
punRemoved, punCount = removePunctuation('Macbeth.txt', 'Macbeth-punctuationRemoved.txt')
print(f'Total Punctuation : {punCount}')
punRemoved.read()