如何逐个从文本文件中调用URL

时间:2016-09-26 21:00:30

标签: python parsing web-scraping beautifulsoup

我想在一个网站上解析一些URL,我创建了一个文本文件,其中包含我要解析的所有链接。如何在python程序上逐个从文本文件中调用此URL。

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("https://www.example.com").content, "html.parser")

for d in soup.select("div[data-selenium=itemDetail]"):
    url = d.select_one("h3[data-selenium] a")["href"]
    upc = BeautifulSoup(requests.get(url).content, "html.parser").select_one("span.upcNum")
    if upc:
        data = json.loads(d["data-itemdata"])
        text = (upc.text.strip())
        print(upc.text)
        outFile = open('/Users/Burak/Documents/new_urllist.txt', 'a')
        outFile.write(str(data))
        outFile.write(",")
        outFile.write(str(text))               
        outFile.write("\n")
        outFile.close()

urllist.txt中

https://www.example.com/category/1
category/2
category/3
category/4

提前致谢

1 个答案:

答案 0 :(得分:0)

使用上下文管理器:

with open("/file/path") as f:
    urls = [u.strip('\n') for u in f.readlines()]

您获取包含文件中所有网址的列表,然后可以根据需要调用它们。