我有大约1,200个文本文件的集合,每个文本大约有5,000个单词。每个文件都是电话会议的记录,包含公司名称。我希望能够处理文件以删除公司名称,因为它们经常重复,因此与我想要做的其他处理无关。为此,我尝试编写一个脚本来为公司名称创建一组自定义停止名称。我的想法是创建格式与NLTK禁用词列表相同的文件,并且可以以相同的方式使用。
以下是公司名称的输入文件('stops_Analyst_Companies.txt')的片段; 'Atlantic Equities LLP','Avon Capital Advisors','巴克莱资本','Bernstein','BGC Partners','BMO资本市场美国',......总共有80个名字。
然后我使用另一个脚本从成绩单文件中删除公司名称,我还从成绩单文件中删除NLTK英语停用词,然后挑选该文件以供后续使用。虽然成功删除了NLTK停用词,但不会删除自定义停用词文件中的单词 我正在尝试的是超出我有限的python命令的几个步骤,所以建议和指导将不胜感激。
以下是为分析公司创建自定义停用词文件的脚本;
import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
这是来自'cln_stps_Analyst_Companies.txt'文件的片段; [“'','a','t','l','a','n','t','i','c','','e','q','u ','我','t','我','e','s','','l','l','p',''',',',''',' a','v','o','n','','c','a','p','i','t','a','l','',' a','d','v','i','s','o','r','s',''“
这是从脚本文件中删除停止名称的脚本;
import os, os.path, sys, nltk, re, pprint, pickle
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
with open('TestStopWords_1.txt','r') as fin:
wordtokens=word_tokenize(fin.read())
lowcase= [w.lower() for w in wordtokens]
# Remove NTLK Stopwords
nostops = [w for w in lowcase if not w in stopset]
print ('NLTK Stopset Words Removed')
print (' ')
print (nostops)
print (' ')
with open ('cln_stops_Analyst_Companies.txt', 'r') as cln_stops_Analyst_Companies:
customstops = cln_stops_Analyst_Companies.read()
nostops = [w for w in nostops if not w in customstops]
print ('Analyst Companies Names Removed')
print (' ')
print (nostops )
nostops = str(nostops)
with open ("nostops.pickle", 'wb') as outfile:
pickle.dump (nostops, outfile)
print (' ')
print (' Pickle File Created')
print (' ')
这是来自TestStopWords文件的片段;
a,分析公司,Atlantic Equities LLP,Avon Capital Advisors这是'nostops.pickle'文件中的一个片段;