返回在不同列表上计算的单词列表

时间:2016-10-28 14:49:26

标签: python string count notepad++ python-3.5

大家下午好,

今天我被要求编写以下功能:

def compareurl(url1,url2,enc,n)

此函数比较两个网址并返回包含以下内容的列表:

[word,occ_in_url1,occ_in_u2]

其中:

字---> n lenght的话

occ_in_url1 ---> url1中的单词时间

occ_in_url2 ---> url2中的单词时间

所以我开始编写函数,这是我到目前为止写的:

def compare_url(url1,url2,enc,n):
    from urllib.request import urlopen
    with urlopen('url1') as f1:
        readpage1 = f1.read()
        decodepage1 = readpage1.decode('enc')
    with urlopen('url2') as f2:
        readpage2 = f2.read()
        decodepage2 = readpage2.decode('enc')
    all_lower1 = decodepage1.lower()
    all_lower2 = decodepage2.lower()
    import string
    all_lower1nopunctuation = "".join(l for l in all_lower1 if l not in string.punctuation)
    all_lower2nopunctuation = "".join(l for l in all_lower2 if l not in string.punctuation)
    for word1 in all_lower1nopunctuation:
        if len(word1) == k:
            all_lower1nopunctuation.count(word1)
    for word2 in all_lower2nopunctuation:
        if len(word2) == k:
            all_lower2opunctuation.count(word2)
    return(word1,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
    return(word2,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))

但是这段代码并没有按照我的想法运作,实际上根本不起作用。

我还想:

  1. 逐渐对返回列表进行排序(从返回次数最多的单词)

  2. 如果2个单词出现的次数相同,则必须将其返回 字母顺序

1 个答案:

答案 0 :(得分:0)

您的代码中存在一些拼写错误(请注意将来的内容),但也存在一些python问题(或可以改进的内容)。

首先,您的imports should come in the top文件

from urllib.request import urlopen
import string

您应该使用urlopen致电string,这就是您正在做的事情,但此字符串为'url1'而非'http://...'。您不在引号内使用变量:

with urlopen(url1) as f1: #remove quotes
    readpage1 = f1.read()
    decodepage1 = readpage1.decode(enc) #remove quotes
with urlopen(url2) as f2: #remove quotes
    readpage2 = f2.read()
    decodepage2 = readpage2.decode(enc) #remove quotes

您需要改进 all_lower1nopunctuation 初始化。您正在将stackoverflow.com替换为stackoverflowcomstackoverflow com实际应为#all_lower1nopunctuation = "".join(l for l in all_lower1 if l not in string.punctuation) #the if statement should be after 'l' and before 'for' #you should include 'else' to replace the punctuation with a space all_lower1nopunctuation = ''.join(l if l not in string.punctuation else ' ' for l in all_lower1) all_lower2nopunctuation = ''.join(l if l not in string.punctuation else ' ' for l in all_lower2)

for

all_lower1nopunctuation.count(word1)合并为一个。还要在集合中添加找到的单词(唯一元素列表)。

for word1 in all_lower1nopunctuation返回 word_ 出现在 all_lower1nopunctuation 中的次数。它增加一个计数器。

.split(' ')不起作用,因为 all_lower1nopunctuation 字符串(而不是列表)。使用.replace('\n', '')将其转换为列表

#for word1 in all_lower1nopunctuation: # if len(word1) == k: #also, this should be == n, not == k # all_lower1nopunctuation.count(word1) #for word2 in all_lower2nopunctuation: # if len(word2) == k: # all_lower2opunctuation.count(word2) word_set = set([]) for word in all_lower1nopunctuation.replace('\n', '').split(' '): if len(word) == n and word in all_lower2nopunctuation: word_set.add(word) #set uses .add() instead of .append() 删除所有换行符,否则它们也会被计为单词

count_list = []
for final_word in word_set:
    count_list.append((final_word,
    all_lower1nopunctuation.count(final_word),
    all_lower2nopunctuation.count(final_word)))

现在您在两个网址上都有一组字词,您需要存储每个网址中有多少。 以下代码将确保您拥有元组列表you asked

return
  

返回意味着函数已经完成,并且解释器会在函数被调用之前的任何地方继续,所以返回后的任何内容都无关紧要。

正如RemcoGerlich所述。

您的代码始终返回第一个 #return(word1,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2)) #return(word2,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2)) return(count_list) # which contains a list of tuples with all words and its counts ,因此您需要将两个返回合并为一个。

from urllib.request import urlopen
import string

def compare_url(url1,url2,enc,n):
    with urlopen(url1) as f1:
        readpage1 = f1.read()
        decodepage1 = readpage1.decode(enc)
    with urlopen(url2) as f2:
        readpage2 = f2.read()
        decodepage2 = readpage2.decode(enc)

    all_lower1 = decodepage1.lower()
    all_lower2 = decodepage2.lower()

    all_lower1nopunctuation = ''.join(l if l not in string.punctuation
    else ' ' for l in all_lower1)
    all_lower2nopunctuation = ''.join(l if l not in string.punctuation
    else ' ' for l in all_lower2)

    word_set = set([])
    for word in all_lower1nopunctuation.replace('\n', '').split(' '):
        if len(word) == n and word in all_lower2nopunctuation:
            word_set.add(word)

    count_list = []
    for final_word in word_set:
        count_list.append((final_word,
        all_lower1nopunctuation.count(final_word),
        all_lower2nopunctuation.count(final_word)))

    return(count_list)

url1 = 'https://www.tutorialspoint.com/python/list_count.htm'
url2 = 'https://stackoverflow.com/a/128577/7067541'

for word_count in compare_url(url1,url2, 'utf-8', 5):
    print (word_count)

<强> TL; DR

$params = [
    'index' => 'test_index',
    'type' => 'test_index_type',
    'body' => [
        'query' => [
            'bool' => [
                'should' => [
                    [ 'match' => [ 'field1' => '12' ] ],
                    [ 'multi_match' => [ 'query' => '345',
                                         'fields' => ['field2', 'field3']] ],
                ]
            ]
        ]
    ]
];