Question

我试图在一段时间内用特定的标签分析推文。所有推文都存储在22个不同的CSV文件中，因为我必须按周分解它们以避免连接导致的随机超时。然后我尝试使用以下代码对推文进行标记（我在下面的代码中使用了pandas和spacy）：

for i in range(1, 22):  # Loop thru all 22 csv files
filename = "RHOBH_ep" + str(i) + ".csv"
data = pd.read_csv(filename)
data['tweet_tokens'] = 'NaN' #Create a column for tokenized tweets
for j in range(len(data['tweet'])): #Loop thru the tweets
    doc = nlp(data['tweet'][j]) #Tokenize the tweet
    tokens = [token.text for token in doc]
    indexes = [m.span() for m in re.finditer('#\w+',data['tweet'][j],flags=re.IGNORECASE)] #Get the indexes for the hashtags
    for start, end in indexes: #Merge the hashtag and the word back together
        doc.merge(start_idx = start, end_idx = end)
    tokens = [token.text for token in doc]
    data['tweet_tokens'][j] = tokens #Save the tokenized tweets with hashtags put back with the word
    print('Finished: ' + filename + ' | ' + str(j + 1) + ' / ' + str(len(data['tweet'])))
new_filename = "RHOBH_ep" + str(i) + "_processed.csv"
data.to_csv(new_filename) #Save to a new csv file
print('>>>> ' + new_filename + ' IS CREATED. >>>>')

我的问题是，几乎我的每个新csv文件都有几行，其格式完全关闭，我不确定它是如何发生的。 Here is a screenshot of one of the file

有人可以帮帮我吗？

标记推文并将其保存到csv文件，但格式已经关闭

0 个答案: