大家下午好,
今天我被要求编写以下功能:
def compareurl(url1,url2,enc,n)
此函数比较两个网址并返回包含以下内容的列表:
[word,occ_in_url1,occ_in_u2]
其中:
字---> n lenght的话
occ_in_url1 ---> url1中的单词时间
occ_in_url2 ---> url2中的单词时间
所以我开始编写函数,这是我到目前为止写的:
def compare_url(url1,url2,enc,n):
from urllib.request import urlopen
with urlopen('url1') as f1:
readpage1 = f1.read()
decodepage1 = readpage1.decode('enc')
with urlopen('url2') as f2:
readpage2 = f2.read()
decodepage2 = readpage2.decode('enc')
all_lower1 = decodepage1.lower()
all_lower2 = decodepage2.lower()
import string
all_lower1nopunctuation = "".join(l for l in all_lower1 if l not in string.punctuation)
all_lower2nopunctuation = "".join(l for l in all_lower2 if l not in string.punctuation)
for word1 in all_lower1nopunctuation:
if len(word1) == k:
all_lower1nopunctuation.count(word1)
for word2 in all_lower2nopunctuation:
if len(word2) == k:
all_lower2opunctuation.count(word2)
return(word1,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
return(word2,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
但是这段代码并没有按照我的想法运作,实际上根本不起作用。
我还想:
逐渐对返回列表进行排序(从返回次数最多的单词)
如果2个单词出现的次数相同,则必须将其返回 字母顺序
答案 0 :(得分:0)
您的代码中存在一些拼写错误(请注意将来的内容),但也存在一些python问题(或可以改进的内容)。
首先,您的imports should come in the top文件
from urllib.request import urlopen
import string
您应该使用urlopen
致电string
,这就是您正在做的事情,但此字符串为'url1'
而非'http://...'
。您不在引号内使用变量:
with urlopen(url1) as f1: #remove quotes
readpage1 = f1.read()
decodepage1 = readpage1.decode(enc) #remove quotes
with urlopen(url2) as f2: #remove quotes
readpage2 = f2.read()
decodepage2 = readpage2.decode(enc) #remove quotes
您需要改进 all_lower1nopunctuation 初始化。您正在将stackoverflow.com
替换为stackoverflowcom
,stackoverflow com
实际应为#all_lower1nopunctuation = "".join(l for l in all_lower1 if l not in string.punctuation)
#the if statement should be after 'l' and before 'for'
#you should include 'else' to replace the punctuation with a space
all_lower1nopunctuation = ''.join(l if l not in string.punctuation
else ' ' for l in all_lower1)
all_lower2nopunctuation = ''.join(l if l not in string.punctuation
else ' ' for l in all_lower2)
。
for
将all_lower1nopunctuation.count(word1)
合并为一个。还要在集合中添加找到的单词(唯一元素列表)。
for word1 in all_lower1nopunctuation
返回 word_ 出现在 all_lower1nopunctuation 中的次数。它不增加一个计数器。
.split(' ')
不起作用,因为 all_lower1nopunctuation 是字符串(而不是列表)。使用.replace('\n', '')
将其转换为列表。
#for word1 in all_lower1nopunctuation:
# if len(word1) == k: #also, this should be == n, not == k
# all_lower1nopunctuation.count(word1)
#for word2 in all_lower2nopunctuation:
# if len(word2) == k:
# all_lower2opunctuation.count(word2)
word_set = set([])
for word in all_lower1nopunctuation.replace('\n', '').split(' '):
if len(word) == n and word in all_lower2nopunctuation:
word_set.add(word) #set uses .add() instead of .append()
删除所有换行符,否则它们也会被计为单词。
count_list = []
for final_word in word_set:
count_list.append((final_word,
all_lower1nopunctuation.count(final_word),
all_lower2nopunctuation.count(final_word)))
现在您在两个网址上都有一组字词,您需要存储每个网址中有多少字。 以下代码将确保您拥有元组列表为you asked
return
返回意味着函数已经完成,并且解释器会在函数被调用之前的任何地方继续,所以返回后的任何内容都无关紧要。
正如RemcoGerlich所述。
您的代码始终仅返回第一个 #return(word1,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
#return(word2,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
return(count_list) # which contains a list of tuples with all words and its counts
,因此您需要将两个返回合并为一个。
from urllib.request import urlopen
import string
def compare_url(url1,url2,enc,n):
with urlopen(url1) as f1:
readpage1 = f1.read()
decodepage1 = readpage1.decode(enc)
with urlopen(url2) as f2:
readpage2 = f2.read()
decodepage2 = readpage2.decode(enc)
all_lower1 = decodepage1.lower()
all_lower2 = decodepage2.lower()
all_lower1nopunctuation = ''.join(l if l not in string.punctuation
else ' ' for l in all_lower1)
all_lower2nopunctuation = ''.join(l if l not in string.punctuation
else ' ' for l in all_lower2)
word_set = set([])
for word in all_lower1nopunctuation.replace('\n', '').split(' '):
if len(word) == n and word in all_lower2nopunctuation:
word_set.add(word)
count_list = []
for final_word in word_set:
count_list.append((final_word,
all_lower1nopunctuation.count(final_word),
all_lower2nopunctuation.count(final_word)))
return(count_list)
url1 = 'https://www.tutorialspoint.com/python/list_count.htm'
url2 = 'https://stackoverflow.com/a/128577/7067541'
for word_count in compare_url(url1,url2, 'utf-8', 5):
print (word_count)
<强> TL; DR 强>
$params = [
'index' => 'test_index',
'type' => 'test_index_type',
'body' => [
'query' => [
'bool' => [
'should' => [
[ 'match' => [ 'field1' => '12' ] ],
[ 'multi_match' => [ 'query' => '345',
'fields' => ['field2', 'field3']] ],
]
]
]
]
];