嗨,我希望有人可以帮助我,我试图整合数据集,其中包括几个新闻频道的医疗保健推文(例如英国广播公司,cnn,dailyhealth,foxnewshealth,gdnhealthcare,goodhealth,KaiserHealthNews,latimeshealth,msnhealthnews ,NBChealth,nprhealth,nytimeshealth,reuters_health,usnewshealth,wsjhealth)
现在数据集由|
分隔,但此符号在推文出现前重复两次,例如数据集中的样本:
585978391360221184|Thu Apr 09 01:31:50 +0000 2015|Breast cancer risk test devised http://bbc.in/1CimpJF
使用正则表达式我能够将前一行中的每一行与|
分开但我删除前两个参数并保持推文在聚类中使用它。我能够找到一个分隔前两个参数的代码
import re
x = "585978391360221184|Thu Apr 09 01:31:50 +0000 2015|Breast cancer risk test devised http://bbc.in/1CimpJF"
d = print(re.split('\|', x).pop(-1))
它提供了我需要的输出
Breast cancer risk test devised http://bbc.in/1CimpJF
但是,当我将它应用于整个数据集时,它带有此输出;它是来自新闻机构文件的推文的集合:
["C. diff 'manslaughter' inquiry call ", 'Health Canada to stop sales of small magnets ', "Robin Roberts' cancer diagnosis ", 'Americans die sooner and are sicker than those in other high-income countries. Does this worry you? ', 'Clinton Kelly’s fresh and
#fruity take on #holiday dishes #HappyThanksgiving', '"The biggest challenge facing my department, but also the NHS as a whole, is the lack of money." ', 'RT @MSNHealth: The Mediterranean? The Volumetrics? Or maybe the DASH? U.S. News’ Best Overall Diet Plans of 2011: ', "Health law's promise of coverage not resonating with Miami's uninsured. ", 'O.B. Ultra tampons are coming back, and the company apologizes with a song ', 'Mental Illness Affects Women, Men Differently, Study Finds: ', "Why it's so hard to get the flu vaccine supply right ", 'Infection Risk Prompts New York City To Regulate Ritual Circumcision ', 'The Doctor’s World: Link to Ethical Scandals Tarnishes Prestigious Parran Award', 'New York lawmakers announce measures to confront heroin epidemic ', "RT @leonardkl: Are you getting #healthinsurance for the first time beginning Jan. 1? I'd love to interview you for a @usnews story! Let me …\n", 'For Desperate Family in India, a Ray of Hope From Alabama ']
ps(请注意他们在每条推文后缩短了网址,但我无法发帖)
这是代码:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from sklearn.metrics import adjusted_rand_score
import numpy as np
import glob
import os
file_list = glob.glob(os.path.join(os.getcwd(), "E:/Health-News-Tweets/Health-Tweets", "*.txt"))
corpus = []
labels = ["bbchealth","cbchealth","cnnhealth","everydayhealth","foxnewshealth","gdnhealthcare","goodhealth","KaiserHealthNews","latimeshealth"
,"msnhealthnews","NBChealth","nprhealth","nytimeshealth","reuters_health","usnewshealth","wsjhealth"]
for file_path in file_list:
with open(file_path,'r') as f_input:
data = f_input.read()
x = re.split('\|', data).pop(-1)
corpus.append(x)
print(corpus)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
true_k = 16
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=10,random_state=3425)
model.fit(X)
Y = vectorizer.transform(["An abundance of online info can turn us into e-hypochondriacs. Or, worse, lead us to neglect getting the care we need"])
prediction = model.predict(Y)
#print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
#print("terms",terms)
for i in range(true_k):
#print("Cluster %d:" % i)
if(prediction == i):
print("The predicted cluster",labels[i])
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
#print
#print(prediction)
# for ind in order_centroids[i, :10]:
#print(' %s' % terms[ind]),
# print
我的问题在这里,我如何分离前两个参数(只是为了清楚我有16个健康新闻频道,这是群集的标签)以及如何将它应用于整个16个文件。