在数据集上使用正则表达式分离PYTHON

时间:2018-05-14 23:35:45

标签: python regex twitter dataset

嗨,我希望有人可以帮助我,我试图整合数据集,其中包括几个新闻频道的医疗保健推文(例如英国广播公司,cnn,dailyhealth,foxnewshealth,gdnhealthcare,goodhealth,KaiserHealthNews,latimeshealth,msnhealthnews ,NBChealth,nprhealth,nytimeshealth,reuters_health,usnewshealth,wsjhealth)

现在数据集由|分隔,但此符号在推文出现前重复两次,例如数据集中的样本:

585978391360221184|Thu Apr 09 01:31:50 +0000 2015|Breast cancer risk test devised http://bbc.in/1CimpJF

使用正则表达式我能够将前一行中的每一行与|分开但我删除前两个参数并保持推文在聚类中使用它。我能够找到一个分隔前两个参数的代码

import re
x = "585978391360221184|Thu Apr 09 01:31:50 +0000 2015|Breast cancer risk test devised http://bbc.in/1CimpJF"
d = print(re.split('\|', x).pop(-1))

它提供了我需要的输出

Breast cancer risk test devised http://bbc.in/1CimpJF

但是,当我将它应用于整个数据集时,它带有此输出;它是来自新闻机构文件的推文的集合:

["C. diff 'manslaughter' inquiry call  ", 'Health Canada to stop sales of small magnets ', "Robin Roberts' cancer diagnosis ", 'Americans die sooner and are sicker than those in other high-income countries. Does this worry you? ', 'Clinton Kelly’s fresh and
#fruity take on #holiday dishes   #HappyThanksgiving', '"The biggest challenge facing my department, but also the NHS as a whole, is the lack of money." ', 'RT @MSNHealth: The Mediterranean? The Volumetrics? Or maybe the DASH? U.S. News’ Best Overall Diet Plans of 2011: ', "Health law's promise of coverage not resonating with Miami's uninsured. ", 'O.B. Ultra tampons are coming back, and the company apologizes with a song ', 'Mental Illness Affects Women, Men Differently, Study Finds: ', "Why it's so hard to get the flu vaccine supply right ", 'Infection Risk Prompts New York City To Regulate Ritual Circumcision ', 'The Doctor’s World: Link to Ethical Scandals Tarnishes Prestigious Parran Award', 'New York lawmakers announce measures to confront heroin epidemic ', "RT @leonardkl: Are you getting #healthinsurance for the first time beginning Jan. 1? I'd love to interview you for a @usnews story! Let me …\n", 'For Desperate Family in India, a Ray of Hope From Alabama ']

ps(请注意他们在每条推文后缩短了网址,但我无法发帖)

这是代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from sklearn.metrics import adjusted_rand_score
import numpy as np

import glob
import os


file_list = glob.glob(os.path.join(os.getcwd(), "E:/Health-News-Tweets/Health-Tweets", "*.txt"))

corpus = []
labels = ["bbchealth","cbchealth","cnnhealth","everydayhealth","foxnewshealth","gdnhealthcare","goodhealth","KaiserHealthNews","latimeshealth"
          ,"msnhealthnews","NBChealth","nprhealth","nytimeshealth","reuters_health","usnewshealth","wsjhealth"]

for file_path in file_list:
    with open(file_path,'r') as f_input:
        data = f_input.read()
        x = re.split('\|', data).pop(-1)
        corpus.append(x)

print(corpus)

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

true_k = 16
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=10,random_state=3425)
model.fit(X)

Y = vectorizer.transform(["An abundance of online info can turn us into e-hypochondriacs. Or, worse, lead us to neglect getting the care we need"])
prediction = model.predict(Y)

#print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
#print("terms",terms)
for i in range(true_k):
    #print("Cluster %d:" % i)
    if(prediction == i):
        print("The predicted cluster",labels[i])
    for ind in order_centroids[i, :10]:
         print(' %s' % terms[ind]),
    #print
#print(prediction)
    #  for ind in order_centroids[i, :10]:
        #print(' %s' % terms[ind]),
  #  print

我的问题在这里,我如何分离前两个参数(只是为了清楚我有16个健康新闻频道,这是群集的标签)以及如何将它应用于整个16个文件。

0 个答案:

没有答案