如何为每组链接保存txt文件?

时间:2016-07-01 14:21:23

标签: python-3.x web-scraping beautifulsoup python-requests

我正在尝试抓取多页黄页并将打印输出存储在txt文件中。我知道在这些页面上获取数据不需要登录,我只是想尝试登录一下requests.Session()。

我想将set_1中每个网址的标题存储在txt文件YP_set_1.txt中。对于set_2中的url也一样。

这是我的代码。

import requests
from bs4 import BeautifulSoup
import requests.cookies
import time



s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
           'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"}

url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed&register=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

csrf = soup.find("input", value=True)["value"]

USERNAME = '****.*****@*****.***'
PASSWORD = '*******'

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)

s.post(url, data=login_data, headers=headers)

set_1 = "This is the first set."

targeted_pages = ['https://www.yellowpages.com/brookfield-wi/business',
                  'https://www.yellowpages.com/bronx-ny/cheap-party-halls',
                  'https://www.yellowpages.com/bronx-ny/24-hour-liquor-store',
                  'https://www.yellowpages.com/bronx-ny/24-hour-oil-change',
                  'https://www.yellowpages.com/bronx-ny/auto-insurance',
                  'https://www.yellowpages.com/bronx-ny/awnings-canopies',
                  'https://www.yellowpages.com/bronx-ny/golden-corral',
                  'https://www.yellowpages.com/bronx-ny/concrete-contractors',
                  'https://www.yellowpages.com/bronx-ny/automobile-salvage',
                  'https://www.yellowpages.com/bronx-ny/24-hour-daycare-centers',
                  'https://www.yellowpages.com/bronx-ny/movers',
                  'https://www.yellowpages.com/bronx-ny/nursing-homes',
                  'https://www.yellowpages.com/bronx-ny/signs'
                  ]
for target_urls in targeted_pages:
    targeted_page = s.get(target_urls, headers=headers, cookies=cj)
    targeted_soup = BeautifulSoup(targeted_page.content, "lxml")

    for record in targeted_soup.findAll('title'):
        with open("YP_Set_1.txt", "w") as text_file:
            print(set_1 + '\n' + record.text, file=text_file)
time.sleep(5)

set_2 = "This is the second set."

targeted_pages_2 = ['https://www.yellowpages.com/north-miami-beach-fl/attorneys',
                    'https://www.yellowpages.com/north-miami-beach-fl/employment-agencies',
                    'https://www.yellowpages.com/north-miami-beach-fl/dentists',
                    'https://www.yellowpages.com/north-miami-beach-fl/general-contractors',
                    'https://www.yellowpages.com/north-miami-beach-fl/electricians',
                    'https://www.yellowpages.com/north-miami-beach-fl/pawnbrokers',
                    'https://www.yellowpages.com/north-miami-beach-fl/lighting-fixtures',
                    'https://www.yellowpages.com/north-miami-beach-fl/towing'
                    ]
for target_urls_2 in targeted_pages_2:
    targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
    targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

    for record in targeted_soup_2.findAll('title'):
        with open("YP_Set_2.txt", "w") as text_file:
            print(set_2 + '\n' + record.text, file=text_file)

当我运行代码时,这是YP_Set_1.txt的打印输出。

This is the first set.
Signs in Bronx, New York with Reviews & Ratings - YP.com

YP_Set_2.txt的打印输出。

This is the second set.
Towing in North Miami Beach, Florida with Reviews & Ratings - YP.com

是否有一个快速修复程序可以让我在文本文件中存储集合中每个网址的所有标题,而不是仅仅抓取集合中最后一个网址的标题?感谢您的任何意见。

1 个答案:

答案 0 :(得分:0)

您继续在循环中打开文件,以便继续覆盖内容,您可以继续使用"a"重新打开以替换覆盖"w"的{​​{1}},但更容易在外部打开一次循环:

with open("YP_Set_2.txt", "w") as text_file:
    for target_urls_2 in targeted_pages_2:
        targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
        targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

        for record in targeted_soup_2.find_all('title'):            
                text_file.write(set_2 + '\n' + record.text)

对两个块都做同样的事情。