在谈到python / gensim时,我还是一个初学者,但是我发现它非常有用,并且正在用它来写我的哥特式小说主题论文。
我使用gensim的方法是预处理文本,使用nltk.corpus.reader.PlaintextCorpusReader
将它们加载到语料库中,然后将gensim主题模型应用于该语料库。
我的问题是,当我直接从Data
文件夹加载文本时,
gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r"C:/Users/maart/Documents/Thesis/Data/", r".*\.txt", encoding = "ISO-8859-1")
我得到了想要的输出,但是当我使用pandas数据框选择某些类别的文本时,将这些文本打印在看起来完全相同的另一个文件夹中(请参见twodatadirectories.jpg
),然后将其加载到nltk.corpus.reader
,输出结果大不相同
gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r"C:/Users/maart/Documents/Thesis/n." + handle + "-all/", r".*\.txt", encoding = "ISO-8859-1").
情况如何?如何从直接映射到数据映射而不是从preselect-printsamedatatonewmap模型获取结果?
代码(很抱歉,很长):
# GOTHICCORPUS EXCEL RETRIEVER
# Google Sheets to Dataframe
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name(r"C:\Users\maart\Documents\Thesis\gothiccorpus-d2bcb2d5f193.json", scope)
client = gspread.authorize(creds)
sheet = client.open("Gothic Fiction Corpus - skinned").sheet1
gc = pd.DataFrame(sheet.get_all_records())
cols = gc.columns
#print(cols)
# Handles
code = gc[['Code']]
corpus = gc[['Corpus']]
path = gc[['Path']]
link = gc[['Link']]
found = gc[['Found']]
quality = gc[['Quality']]
longs = gc[['F']]
epigraph = gc[['Complete']]
year = gc[['Year']]
title = gc[['Title']]
placegoed = gc[['Place']]
place = gc[['Place1']]
time = gc[['Time']]
year = gc[['Year']]
author = gc[['Author']]
author_gender = gc[['Au_gender']]
author_nation = gc[['Au_nationality']]
author_affilation = gc[['Au_affilation']]
# Overview
overview = gc[['Code','Title', 'Path']]
#print(overview)ere...
# MAIN CORPUS - create/paste
mains = gc.loc[gc['Corpus'].isin(['IF', 'GN', 'BL', 'WG'])] #ALL
mainspath = mains['Path']
mainer = mainspath.tolist()
mainlist = []
for m in mainer:
m = m.split(",")
mainlist.extend(m)
mainfiles = []
for m in mainlist:
pakker = open(m, "r")
for pak in pakker:
lees = pakker.read()
#print(lees[:20])
mainfiles.append(lees)
# Make new dir for the corpus
maindirhandle = ("C:/Users/maart/Documents/Thesis/n." + handle + "-all/")
maincorpusdir = maindirhandle
if not os.path.isdir(maincorpusdir):
os.mkdir(maincorpusdir) #created dir
mainfilename = 0
for maintext in mainfiles:
mainfilename+=1
with open(maincorpusdir+str(mainfilename)+'.txt','w') as fout:
fout.write(maintext)
fout.close() ##??
mainidgrabber = PlaintextCorpusReader("C:/Users/maart/Documents/Thesis/n." + handle + "-all/", ".*")
mainids = mainidgrabber.fileids() #allfilesincorpus
# Rename Main files
mainpad = ("C:/Users/maart/Documents/Thesis/n." + handle + "-all/")
maindirs = sorted(mainids, key=lambda x: int(re.sub('\D', '', x)))
mainshortfnames = []
for name in mainlist:
mainshortfnames.append(name.replace("C:/Users/maart/Documents/Thesis/Data/", ""))
with open('C:/Users/maart/Documents/Thesis/Handles/' + handle + '-all.csv', 'w', newline='') as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(zip(maindirs,mainshortfnames))
newpath = mainpad
os.chdir(newpath)
with open('C:/Users/maart/Documents/Thesis/Handles/' + handle + '-all.csv') as f:
lines = csv.reader(f)
for line in lines:
os.rename(line[0], line[1])
gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r"C:/Users/maart/Documents/Thesis/n." + handle + "-all/", r".*\.txt", encoding = "ISO-8859-1")
gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r"C:/Users/maart/Documents/Thesis/Data/", r".*\.txt", encoding = "ISO-8859-1")
mainids = gothiccorpus.fileids() #allfilesincorpus
print(mainids)
maindata = []
for fileid in gothiccorpus.fileids():
document = ' '.join(gothiccorpus.words(fileid))
maindata.append(document)
print(len(maindata))
# print(maindata[173])
# MAIN CORPUS - data
maindata = [re.sub('\S*@\S*\s?', '', sent) for sent in maindata]
maindata = [re.sub('\s+', ' ', sent) for sent in maindata]
maindata = [re.sub("\'", "", sent) for sent in maindata]
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
maindata_words = list(sent_to_words(maindata))
#BUILD NLTK AND DIC
mainbigram = gensim.models.Phrases(maindata_words, min_count=5, threshold=100) # higher threshold fewer phrases.
maintrigram = gensim.models.Phrases(mainbigram[maindata_words], threshold=100)
mainbigram_mod = gensim.models.phrases.Phraser(mainbigram)
maintrigram_mod = gensim.models.phrases.Phraser(maintrigram)
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(maintexts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in maintexts]
def make_bigrams(maintexts):
return [mainbigram_mod[doc] for doc in maintexts]
def make_trigrams(maintexts):
return [maintrigram_mod[mainbigram_mod[doc]] for doc in maintexts]
def lemmatization(maintexts):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in maintexts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc])
return texts_out
maindata_words_nostops = remove_stopwords(maindata_words)
maindata_words_bigrams = make_bigrams(maindata_words_nostops)
nlp = spacy.load('en', disable=['parser', 'ner'])
maindata_lemmatized = lemmatization(maindata_words_bigrams)
mainid2word = corpora.Dictionary(maindata_lemmatized)
maintexts = maindata_lemmatized
maincorpus = [mainid2word.doc2bow(maintext) for maintext in maintexts]
mainmmhandle = ("C:/Users/maart/Documents/Thesis/Handles/" + handle + "-all.mm")
corpora.MmCorpus.serialize(mainmmhandle, maincorpus) #save dic
mainditcijfer = mainid2word[5432]
print(mainditcijfer)
# Build LDA model
mainlda_model = gensim.models.ldamodel.LdaModel(corpus=maincorpus,
id2word=mainid2word,
num_topics=100,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
doc_lda = mainlda_model[maincorpus]
#???
NUM_TOPICS = 100
NUM_WORDS = 100
pprint(mainlda_model.print_topics(num_topics=NUM_TOPICS, num_words=NUM_WORDS))
因为我使用第二种类型的数据检索来为某些类型的小说(在某些数据之前写的小说,由某些国籍的作家写的小说)创建主题,并且我想保持一致,所以我很想获得使用熊猫检索方法输出“正确”主题。
我在做错什么,我机智吗?
非常感谢您参加本论坛,我已经通过它解决了许多错误。