我正在尝试在Python中进行一些文本清理以进行情绪分析。但是我没有将所有文本混为一谈并将它们分开,而是希望用每个句子清理文本。为此,我在我的函数中使用了[In] data = pd.read_csv('twitter_AC.csv')
[In] data.head()
0 We're #hiring! Click to apply: Vaccine Special...
1 Can you recommend anyone for this #job? Vaccin...
2 We're #hiring! Read about our latest #job open...
3 We're #hiring! Read about our latest #job open...
4 We're #hiring! Read about our latest #job open...
Name: text, dtype: object
[In] def text_process(text):
'''
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Return the cleaned text as a list of sentences
'''
for i in text:
nopunc = [word for word in i if word not in string.punctuation]
nopunc = ''.join(nopunc)
return [nopunc.lower()]
[In] text_process(data)
[Out] ['were hiring read about our latest job opening here immunization rn httpstcopxczq5zrhr healthcare fairfax va careerarc']
循环,但问题是它只返回我数据框中的1个句子。
import random as r
def generate_uuid():
random_string = ''
random_str_seq = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
uuid_format = [8, 4, 4, 4, 12]
for n in uuid_format:
for i in range(0,n):
random_string += str(random_str_seq[r.randint(0, len(random_str_seq) - 1)])
if n != 12:
random_string += '-'
return random_string
我无法弄清楚为什么函数没有输出我的数据帧中的所有行。另外,我不明白为什么它只取出一行而不是第一行。
答案 0 :(得分:0)
您正在迭代集合中的每个元素,但每次都要覆盖nopunc
变量。
所以,你只是返回遍历的最后一行。
试试这个:
def text_process(text):
'''
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Return the cleaned text as a list of sentences
'''
nopuncList = []
for i in text:
nopunc = [word for word in i if word not in string.punctuation]
nopunc = ''.join(nopunc)
nopuncList.append(nopunc.lower())
return nopuntList
答案 1 :(得分:0)
你的循环有点格格不入。我建议使用list comprehension之类的:
def text_process(text):
return [''.join(word for word in i if word not in string.punctuation).lower()
for i in text]
答案 2 :(得分:0)
您一次又一次地覆盖nopunc
变量。因此,您的函数返回最后一行。尝试使用空列表并在每次迭代时将结果附加到其中。
def text_process(text):
'''
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Return the cleaned text as a list of sentences
'''
result = list()
for i in text:
nopunc = [word for word in i if word not in string.punctuation]
nopunc = ''.join(nopunc)
result.append(nopunc.lower())
return result
希望这有帮助。