我试图在一段时间内用特定的标签分析推文。所有推文都存储在22个不同的CSV文件中,因为我必须按周分解它们以避免连接导致的随机超时。然后我尝试使用以下代码对推文进行标记(我在下面的代码中使用了pandas和spacy):
for i in range(1, 22): # Loop thru all 22 csv files
filename = "RHOBH_ep" + str(i) + ".csv"
data = pd.read_csv(filename)
data['tweet_tokens'] = 'NaN' #Create a column for tokenized tweets
for j in range(len(data['tweet'])): #Loop thru the tweets
doc = nlp(data['tweet'][j]) #Tokenize the tweet
tokens = [token.text for token in doc]
indexes = [m.span() for m in re.finditer('#\w+',data['tweet'][j],flags=re.IGNORECASE)] #Get the indexes for the hashtags
for start, end in indexes: #Merge the hashtag and the word back together
doc.merge(start_idx = start, end_idx = end)
tokens = [token.text for token in doc]
data['tweet_tokens'][j] = tokens #Save the tokenized tweets with hashtags put back with the word
print('Finished: ' + filename + ' | ' + str(j + 1) + ' / ' + str(len(data['tweet'])))
new_filename = "RHOBH_ep" + str(i) + "_processed.csv"
data.to_csv(new_filename) #Save to a new csv file
print('>>>> ' + new_filename + ' IS CREATED. >>>>')
我的问题是,几乎我的每个新csv文件都有几行,其格式完全关闭,我不确定它是如何发生的。 Here is a screenshot of one of the file
有人可以帮帮我吗?