gensim / nltk.corpus.reader相同的数据/设置,不同的输出

时间:2018-06-29 09:31:12

标签: python nltk gensim topic-modeling

在谈到python / gensim时,我还是一个初学者,但是我发现它非常有用,并且正在用它来写我的哥特式小说主题论文。

我使用gensim的方法是预处理文本,使用nltk.corpus.reader.PlaintextCorpusReader将它们加载到语料库中,然后将gensim主题模型应用于该语料库。

我的问题是,当我直接从Data文件夹加载文本时,

gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r"C:/Users/maart/Documents/Thesis/Data/", r".*\.txt", encoding = "ISO-8859-1")

我得到了想要的输出,但是当我使用pandas数据框选择某些类别的文本时,将这些文本打印在看起来完全相同的另一个文件夹中(请参见twodatadirectories.jpg),然后将其加载到nltk.corpus.reader,输出结果大不相同

gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r"C:/Users/maart/Documents/Thesis/n." + handle + "-all/", r".*\.txt", encoding = "ISO-8859-1").

情况如何?如何从直接映射到数据映射而不是从preselect-printsamedatatonewmap模型获取结果?

代码(很抱歉,很长):

# GOTHICCORPUS EXCEL RETRIEVER

# Google Sheets to Dataframe
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name(r"C:\Users\maart\Documents\Thesis\gothiccorpus-d2bcb2d5f193.json", scope)
client = gspread.authorize(creds)
sheet = client.open("Gothic Fiction Corpus - skinned").sheet1

gc = pd.DataFrame(sheet.get_all_records())
cols = gc.columns
#print(cols)

# Handles
code = gc[['Code']]
corpus = gc[['Corpus']]
path = gc[['Path']]
link = gc[['Link']]
found = gc[['Found']]
quality = gc[['Quality']]
longs = gc[['F']]
epigraph = gc[['Complete']]
year = gc[['Year']]
title = gc[['Title']]
placegoed = gc[['Place']]
place = gc[['Place1']]
time = gc[['Time']]
year = gc[['Year']]
author = gc[['Author']]
author_gender = gc[['Au_gender']]
author_nation = gc[['Au_nationality']]
author_affilation = gc[['Au_affilation']]

# Overview
overview = gc[['Code','Title', 'Path']]
#print(overview)ere...

# MAIN CORPUS - create/paste

mains = gc.loc[gc['Corpus'].isin(['IF', 'GN', 'BL', 'WG'])] #ALL
mainspath = mains['Path']
mainer = mainspath.tolist()
mainlist = []
for m in mainer:
    m = m.split(",")
    mainlist.extend(m)
mainfiles = []
for m in mainlist:
    pakker = open(m, "r")
    for pak in pakker:
        lees = pakker.read()
        #print(lees[:20])
    mainfiles.append(lees)

# Make new dir for the corpus
maindirhandle = ("C:/Users/maart/Documents/Thesis/n." + handle + "-all/")
maincorpusdir = maindirhandle
if not os.path.isdir(maincorpusdir):
    os.mkdir(maincorpusdir)    #created dir
mainfilename = 0    
for maintext in mainfiles:
    mainfilename+=1
    with open(maincorpusdir+str(mainfilename)+'.txt','w') as fout:
        fout.write(maintext)
        fout.close() ##??
mainidgrabber = PlaintextCorpusReader("C:/Users/maart/Documents/Thesis/n." + handle + "-all/", ".*")
mainids = mainidgrabber.fileids()        #allfilesincorpus

# Rename Main files
mainpad =  ("C:/Users/maart/Documents/Thesis/n." + handle + "-all/")
maindirs = sorted(mainids, key=lambda x: int(re.sub('\D', '', x)))
mainshortfnames = []
for name in mainlist:
    mainshortfnames.append(name.replace("C:/Users/maart/Documents/Thesis/Data/", ""))    
with open('C:/Users/maart/Documents/Thesis/Handles/' + handle + '-all.csv', 'w', newline='') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(zip(maindirs,mainshortfnames))
newpath = mainpad
os.chdir(newpath)
with open('C:/Users/maart/Documents/Thesis/Handles/' + handle + '-all.csv') as f:
    lines = csv.reader(f)
    for line in lines:
        os.rename(line[0], line[1])

gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r"C:/Users/maart/Documents/Thesis/n." + handle + "-all/", r".*\.txt", encoding = "ISO-8859-1")
gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r"C:/Users/maart/Documents/Thesis/Data/", r".*\.txt", encoding = "ISO-8859-1")
mainids = gothiccorpus.fileids()        #allfilesincorpus
print(mainids)

maindata = []
for fileid in gothiccorpus.fileids():
    document = ' '.join(gothiccorpus.words(fileid))
    maindata.append(document)    

print(len(maindata))
# print(maindata[173])

# MAIN CORPUS - data

maindata = [re.sub('\S*@\S*\s?', '', sent) for sent in maindata]
maindata = [re.sub('\s+', ' ', sent) for sent in maindata]
maindata = [re.sub("\'", "", sent) for sent in maindata]
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations
maindata_words = list(sent_to_words(maindata))

#BUILD NLTK AND DIC
mainbigram = gensim.models.Phrases(maindata_words, min_count=5, threshold=100) # higher threshold fewer phrases.
maintrigram = gensim.models.Phrases(mainbigram[maindata_words], threshold=100)  
mainbigram_mod = gensim.models.phrases.Phraser(mainbigram)
maintrigram_mod = gensim.models.phrases.Phraser(maintrigram)

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(maintexts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in maintexts]
def make_bigrams(maintexts):
    return [mainbigram_mod[doc] for doc in maintexts]
def make_trigrams(maintexts):
    return [maintrigram_mod[mainbigram_mod[doc]] for doc in maintexts]
def lemmatization(maintexts):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in maintexts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc])
    return texts_out

maindata_words_nostops = remove_stopwords(maindata_words)
maindata_words_bigrams = make_bigrams(maindata_words_nostops)
nlp = spacy.load('en', disable=['parser', 'ner'])

maindata_lemmatized = lemmatization(maindata_words_bigrams)
mainid2word = corpora.Dictionary(maindata_lemmatized)
maintexts = maindata_lemmatized
maincorpus = [mainid2word.doc2bow(maintext) for maintext in maintexts]

mainmmhandle = ("C:/Users/maart/Documents/Thesis/Handles/" + handle + "-all.mm")
corpora.MmCorpus.serialize(mainmmhandle, maincorpus)  #save dic

mainditcijfer = mainid2word[5432]
print(mainditcijfer)

# Build LDA model
mainlda_model = gensim.models.ldamodel.LdaModel(corpus=maincorpus,
                                           id2word=mainid2word,
                                           num_topics=100, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)


doc_lda = mainlda_model[maincorpus]
#???

NUM_TOPICS = 100
NUM_WORDS = 100
pprint(mainlda_model.print_topics(num_topics=NUM_TOPICS, num_words=NUM_WORDS))

enter image description here

因为我使用第二种类型的数据检索来为某些类型的小说(在某些数据之前写的小说,由某些国籍的作家写的小说)创建主题,并且我想保持一致,所以我很想获得使用熊猫检索方法输出“正确”主题。

我在做错什么,我机智吗?

非常感谢您参加本论坛,我已经通过它解决了许多错误。

enter image description here

0 个答案:

没有答案