Python,将所有链接,标题和正文文本附加到一个数组或json文件中

时间:2017-04-27 20:41:51

标签: python web-scraping beautifulsoup

帮助,我从网站抓取多个href网址链接,并尝试将网址的每个标题和正文附加到另一个数组中。然而,当我运行类似于此的东西时,我只抓住一个标题,其他链接的所有文本在一起。

request = requests.get(url)
somecontents = request.content
soup = BeautifulSoup(somecontents, "html.parser")
soup.prettify()
gethref = urllinks.get("href")

if gethref is not None and\
  "http" in gethref and\
  "photo" not in gethref and\
  "img" not in gethref:
    page_links = []
    tags_in_link = gethref
    page_links.append(tags_in_link)
    hrefdataset = ','.join(page_links)

for each_link in i:
    website_header_title = soup.title.string
    parse_title = re.sub('[^A-Za-z]+', ' ', website_header_title)
    time.sleep(.05)

    done = grab_web_text(each_link)

    testintry = []
    testintry.append("Website Title: " + parse_title + "," + " ")
    text = testintry.append("Body: " + done)

我希望每个链接都在:我如何根据自己的格式将其格式化?

[{"Website Title: " "title", "Body: " "Body}, 
[{"Website Title: " "title", "Body: " "Body}, 
[{"Website Title: " "title", "Body: " "Body}, 
[{"Website Title: " "title", "Body: " "Body}]

1 个答案:

答案 0 :(得分:1)

您可以像这样创建一个词典列表:

def get_link_info(l):
    parse_title = re.sub('[^A-Za-z]+', ' ', website_header_title)
    done = grab_web_text(each_link)
    return (parse_title, done)

print([{t: d} for t, d in (get_link_info(i) for i in links)])

这是如何运作的?

  1. for i in links是所有链接的循环。
  2. get_link_info返回包含title和`done
  3. 的元组
  4. for t, d in (...)是对结果元组的循环
  5. {t: d} for t, d in (...)dict comprehension
  6. []从生成器创建一个列表。