我试图制作一个猎犬。
我使用 wget 来获取网站,并且 喊出所有的文字。
我想写一个像
这样的字典{'Activity':'index2.html','and':'index2.html','within':'index2.html',...}
{'Rutgers':'index.html','Central':'index.html','Service':'index,html',...}
但我的输出是
{'Activity':'i','and':'n','within':'d',...}
{'Rutgers':'i','Central':'n','Service':'d',...}
它分割了我的文件名。
import string
import os
from bs4 import BeautifulSoup as bs
from os import listdir
from os.path import isfile, join
#from os.path import isdir
mypath = "/Users/Tsu-AngChou/MasterProject/Practice/try_test/"
files = listdir(mypath)
translator = str.maketrans("","",string.punctuation)
storage = []
for f in files:
fullpath = join(mypath, f)
if f == '.DS_Store':
os.remove(f)
elif isfile(fullpath):
print(f)
for html_cont in range(1):
response = open(f,'r',encoding='utf-8')
html_cont = response.read()
soup = bs(html_cont, 'html.parser',from_encoding ='utf-8')
regular_string = soup.get_text()
new_string = regular_string.translate(translator).split()
new_list = [item[:14] for item in new_string]
a = dict(zip(new_list,f))
print(a)
答案 0 :(得分:0)
你需要一个简单的对f
作为一个元素; zip
逐步执行每个序列的元素。尝试这样的事情:
sent = "Activity and within".split()
f = "index.html"
a = dict((word, f) for word in sent)
print(a)
输出:
{'Activity': 'index.html', 'and': 'index.html', 'within': 'index.html'}
答案 1 :(得分:0)
您可以使用dict.fromkeys
:
a = dict.fromkeys(newlist, f)
这会使用newlist
作为键,并为每个键提供相同的值f
。