如何将文件列表从文本文件转换为列

时间:2017-12-13 17:09:56

标签: python python-2.7 csv nlp topic-modeling

我有一个包含元组列表的文本文件。我想将此列表转换为列。

该文件包含以下数据:

[(0, u'0.025*"minimalism" + 0.018*"diwali" + 0.018*"sunday" + 0.018*"minimalistics" + 0.018*"plant" + 0.010*"thought" + 0.010*"take" + 0.010*"httpstcog21yvu1vyo" + 0.010*"time" + 0.010*"cause"'), 
 (1, u'0.029*"panshet" + 0.022*"im" + 0.015*"video" + 0.015*"project" + 0.015*"shade" + 0.015*"nature" + 0.015*"motionphotography\u2026" + 0.015*"motionjpeg" + 0.015*"trip" + 0.015*"lake"'),
 (2, u'0.013*"light" + 0.013*"take" + 0.013*"minimalist" + 0.013*"unm4sk" + 0.013*"first" + 0.013*"minimalism\u2026" + 0.013*"minimal" + 0.013*"possible" + 0.013*"quick" + 0.013*"story"')]

我想要以下格式输出:

topic 0         topic 1     topic 2
minimalism      panshet     light
diwali          im          take
sunday          video       minimalist
minimalistics   project     unm4sk
plant           shade       first

编辑1

with open('LDA.txt') as f:
    lis = [x.split() for x in f]

cols=[x for x in zip(*lis)]
for x in cols:
    print(x)

2 个答案:

答案 0 :(得分:2)

您的第一个错误是您加载数据的方式"从您的文本文件(这甚至是保存数据的最佳方式。如果您要保存python对象,最好使用pickle来执行此操作。)

无论如何,修复很简单。阅读文件时,请致电ast.literal_eval

import ast

with open('LDA.txt') as f:
    data = ast.literal_eval(f.read())

现在出现了你一直在等待的部分。您可以使用re.findall轻松提取单词。对于数据中的每个元组,提取所有单词并存储在字典中。然后,将字典传递给pd.DataFrame构造函数。

import re
import pandas as pd

d = {}
for i, y in data:
    d['topic {}'.format(i)] = re.findall('"(.*?)"', y) 

df = pd.DataFrame(d)

df 
              topic 0             topic 1      topic 2
0          minimalism             panshet        light
1              diwali                  im         take
2              sunday               video   minimalist
3       minimalistics             project       unm4sk
4               plant               shade        first
5             thought              nature  minimalism…
6                take  motionphotography…      minimal
7  httpstcog21yvu1vyo          motionjpeg     possible
8                time                trip        quick
9               cause                lake        story

如果您想要其他方式制表数据(不使用数据框),请参阅here(第二个答案)。

答案 1 :(得分:0)

我认为输出看起来像__str__ LDA模型输出的gensim格式。

而不是打印主题并保存字符串,然后进行后处理:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
model.print_topics(3)

[OUT]:

[(51, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"'), (48, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"'), (42, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"')]

您应该使用models.LdaModel.top_topics()

model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
top3_topics = model.top_topics(corpus)[:3]
for topic, topic_score in top3_topics:
    word_scores, words = zip(*topic)
    top10_words = words[:10]
    print(top10_words)

[OUT]:

('time', 'response', 'user', 'computer', 'human', 'interface', 'system', 'survey', 'eps', 'trees')
('survey', 'minors', 'graph', 'computer', 'human', 'interface', 'user', 'system', 'time', 'response')
('computer', 'human', 'interface', 'user', 'system', 'time', 'survey', 'response', 'eps', 'trees')

如果你想把它们放在pandas.DataFrame

>>> import pandas as pd
>>> 
>>> top10_words_per_topic = []
>>> for topic, topic_score in top3_topics:
...     word_scores, words = zip(*topic)
...     top10_words_per_topic.append(words[:10])
... 


>>> df = pd.DataFrame(top10_words_per_topic).transpose()
>>> df.rename(columns={0:'Topic0', 1:'Topic1', 2:'Topic2'})
      Topic0     Topic1     Topic2
0       time     survey   computer
1   response     minors      human
2       user      graph  interface
3   computer   computer       user
4      human      human     system
5  interface  interface       time
6     system       user     survey
7     survey     system   response
8        eps       time        eps
9      trees   response      trees