帮助,我从网站抓取多个href网址链接,并尝试将网址的每个标题和正文附加到另一个数组中。然而,当我运行类似于此的东西时,我只抓住一个标题,其他链接的所有文本在一起。
request = requests.get(url)
somecontents = request.content
soup = BeautifulSoup(somecontents, "html.parser")
soup.prettify()
gethref = urllinks.get("href")
if gethref is not None and\
"http" in gethref and\
"photo" not in gethref and\
"img" not in gethref:
page_links = []
tags_in_link = gethref
page_links.append(tags_in_link)
hrefdataset = ','.join(page_links)
for each_link in i:
website_header_title = soup.title.string
parse_title = re.sub('[^A-Za-z]+', ' ', website_header_title)
time.sleep(.05)
done = grab_web_text(each_link)
testintry = []
testintry.append("Website Title: " + parse_title + "," + " ")
text = testintry.append("Body: " + done)
我希望每个链接都在:我如何根据自己的格式将其格式化?
[{"Website Title: " "title", "Body: " "Body},
[{"Website Title: " "title", "Body: " "Body},
[{"Website Title: " "title", "Body: " "Body},
[{"Website Title: " "title", "Body: " "Body}]
答案 0 :(得分:1)
您可以像这样创建一个词典列表:
def get_link_info(l):
parse_title = re.sub('[^A-Za-z]+', ' ', website_header_title)
done = grab_web_text(each_link)
return (parse_title, done)
print([{t: d} for t, d in (get_link_info(i) for i in links)])
这是如何运作的?
for i in links
是所有链接的循环。get_link_info
返回包含title
和`done for t, d in (...)
是对结果元组的循环{t: d} for t, d in (...)
是dict comprehension []
从生成器创建一个列表。