在Python中如何从数据集中查找每个不同单词的计数: https://drive.google.com/open?id=1ADdzZp31SwiF70IZ13hbAtPNHBv5NmOY
我已使用以下方法导入数据集:
extern crate hyper;
fn main() {
let test1 = hyper::mime::Mime::from_str("text/html+xml").unwrap();
}
我需要每个元素的支持来执行连续的模式挖掘。例如,假设短语“parking lot”具有绝对支持133,那么对应于“b.txt”中这个频繁连续的顺序模式的行应该是:
133:停车场;很多
答案 0 :(得分:0)
这似乎有效。为字典采样的最大长度短语是变量p_length(I set 3),为排序列表采样的最大长度短语是p_size(我设置3,这个越小,当然顶部频率越高),以及数字最终排名列表中的单词是变量排名(I set 25)。这些设置在第8-10行。它打印的排名列表的长度(见'def top_list():'的近端)是单词数量达到p_length的短语总数。
# Load the data
fin = open("b.txt", 'r')
translist = []
for line in fin:
trans = line.strip().split(' ')
translist.extend(trans)
p_length = 3
p_size = 3
rank = 25
#Use a dictionary to create a histogram1 of the frequencies of the phrases (but this list is not in order)
def histogram1(translist,p_length):
global dict1
dict1 = dict()
phraseList = []
for transIndex in range(len(translist)):
for i in range(p_length):
if (transIndex+1+i) <= len(translist):
phraseElementNow = translist[transIndex+i]
else:
continue
if i > 0:
joinables = (newElement, phraseElementNow)
newElement = ' '.join(joinables)
else:
newElement = phraseElementNow
phraseList.append(newElement)
for element2 in phraseList:
if element2 not in dict1:
dict1[element2] = 1
else:
dict1[element2] += 1
return dict1
#Create the ranked list of phrases vs their frequency.
def top_list():
global topList
topList = []
for key, value in dict1.items():
topList.append((value, key))
topList.sort(reverse = True)
print("Length of ranking list is: ") #Just a check
print(len(topList))
#print(topList[-(rank):]) Used this to check format of ranking list
#Choose the top x ranking to print (I made it 25 on line 9).
def short_list(p_size, rank):
topTopList = []
print("The "+str(rank)+" most common phrases "+str(p_size)+" words long are: ")
for phrase in topList:
phraseParts = phrase[1].split(' ')
if len(phraseParts) == p_size:
topTopList.append(phrase)
else:
continue
for freq, word in topTopList[:rank]:
wordParts = word.split(' ')
wordForPrint = ';'.join(wordParts)
completePrint = str(freq)+':'+wordForPrint
print(completePrint)
print(histogram1(translist, p_length))
top_list()
short_list(p_size, rank)