在文本数据中使用python查找每个单词的支持

时间:2018-02-07 04:14:15

标签: python data-mining

在Python中如何从数据集中查找每个不同单词的计数: https://drive.google.com/open?id=1ADdzZp31SwiF70IZ13hbAtPNHBv5NmOY

我已使用以下方法导入数据集:

extern crate hyper;

fn main() {
    let test1 = hyper::mime::Mime::from_str("text/html+xml").unwrap();
}

我需要每个元素的支持来执行连续的模式挖掘。例如,假设短语“parking lot”具有绝对支持133,那么对应于“b.txt”中这个频繁连续的顺序模式的行应该是:

133:停车场;很多

1 个答案:

答案 0 :(得分:0)

这似乎有效。为字典采样的最大长度短语是变量p_length(I set 3),为排序列表采样的最大长度短语是p_size(我设置3,这个越小,当然顶部频率越高),以及数字最终排名列表中的单词是变量排名(I set 25)。这些设置在第8-10行。它打印的排名列表的长度(见'def top_list():'的近端)是单词数量达到p_length的短语总数。

# Load the data
fin = open("b.txt", 'r')
translist = []
for line in fin:
    trans = line.strip().split(' ')
    translist.extend(trans)

p_length = 3
p_size = 3
rank = 25

#Use a dictionary to create a histogram1 of the frequencies of the phrases (but this list is not in order)
def histogram1(translist,p_length):
    global dict1
    dict1 = dict()
    phraseList = []
    for transIndex in range(len(translist)):
        for i in range(p_length):
            if (transIndex+1+i) <= len(translist):
                phraseElementNow = translist[transIndex+i]
            else:
                continue
            if i > 0:
                joinables = (newElement, phraseElementNow)
                newElement = ' '.join(joinables)
            else:
                newElement = phraseElementNow
            phraseList.append(newElement)
    for element2 in phraseList:
        if element2 not in dict1:
            dict1[element2] = 1
        else:
            dict1[element2] += 1
    return dict1

#Create the ranked list of phrases vs their frequency.
def top_list():
    global topList
    topList = []
    for key, value in dict1.items():
        topList.append((value, key))
    topList.sort(reverse = True)
    print("Length of ranking list is: ") #Just a check
    print(len(topList))
    #print(topList[-(rank):])   Used this to check format of ranking list

#Choose the top x ranking to print (I made it 25 on line 9).
def short_list(p_size, rank):
    topTopList = []
    print("The "+str(rank)+" most common phrases "+str(p_size)+" words long are: ")
    for phrase in topList:
        phraseParts = phrase[1].split(' ')
        if len(phraseParts) == p_size:
            topTopList.append(phrase)
        else:
            continue
    for freq, word in topTopList[:rank]:
        wordParts = word.split(' ')
        wordForPrint = ';'.join(wordParts)
        completePrint = str(freq)+':'+wordForPrint
        print(completePrint)

print(histogram1(translist, p_length))
top_list()
short_list(p_size, rank)