Question

帮助，我从网站抓取多个href网址链接，并尝试将网址的每个标题和正文附加到另一个数组中。然而，当我运行类似于此的东西时，我只抓住一个标题，其他链接的所有文本在一起。

request = requests.get(url)
somecontents = request.content
soup = BeautifulSoup(somecontents, "html.parser")
soup.prettify()
gethref = urllinks.get("href")

if gethref is not None and\
  "http" in gethref and\
  "photo" not in gethref and\
  "img" not in gethref:
    page_links = []
    tags_in_link = gethref
    page_links.append(tags_in_link)
    hrefdataset = ','.join(page_links)

for each_link in i:
    website_header_title = soup.title.string
    parse_title = re.sub('[^A-Za-z]+', ' ', website_header_title)
    time.sleep(.05)

    done = grab_web_text(each_link)

    testintry = []
    testintry.append("Website Title: " + parse_title + "," + " ")
    text = testintry.append("Body: " + done)

我希望每个链接都在：我如何根据自己的格式将其格式化？

[{"Website Title: " "title", "Body: " "Body}, 
[{"Website Title: " "title", "Body: " "Body}, 
[{"Website Title: " "title", "Body: " "Body}, 
[{"Website Title: " "title", "Body: " "Body}]

Answer 1

您可以像这样创建一个词典列表：

def get_link_info(l):
    parse_title = re.sub('[^A-Za-z]+', ' ', website_header_title)
    done = grab_web_text(each_link)
    return (parse_title, done)

print([{t: d} for t, d in (get_link_info(i) for i in links)])

这是如何运作的？

for i in links是所有链接的循环。
get_link_info返回包含title和`done
for t, d in (...)是对结果元组的循环
{t: d} for t, d in (...)是dict comprehension
外[]从生成器创建一个列表。

Python，将所有链接，标题和正文文本附加到一个数组或json文件中

1 个答案: