如何迭代多个链接并逐个抓取每个链接并使用python beautifulsoup和请求将输出保存在csv中

时间:2017-02-26 08:54:48

标签: web-scraping beautifulsoup python-requests

我有这段代码,但不知道如何从CSV或列表中读取链接。我想阅读链接并从每个链接中删除详细信息,然后将每个链接所关注的列中的数据保存到输出CSV中。

以下是我为获取特定数据而构建的代码。

from bs4 import BeautifulSoup
import requests

url = "http://www.ebay.com/itm/282231178856"
r = requests.get(url)

x = BeautifulSoup(r.content, "html.parser")

# print(x.prettify().encode('utf-8'))

# time to find some tags!!

# y = x.find_all("tag")

z = x.find_all("h1", {"itemprop": "name"})

# print z
# for loop done to extracting the title.
for item in z:
    try:
        print item.text.replace('Details about ', '')
    except:
        pass
# category extraction done

m = x.find_all("span", {"itemprop": "name"})

# print m

for item in m:
    try:
        print item.text
    except:
        pass

# item condition extraction done
n = x.find_all("div", {"itemprop": "itemCondition"})

# print n

for item in n:
    try:
        print item.text
    except:
        pass

# sold number extraction done

k = x.find_all("span", {"class": "vi-qtyS vi-bboxrev-dsplblk vi-qty-vert-algn vi-qty-pur-lnk"})

# print k

for item in k:
    try:
        print item.text
    except:
        pass

# Watchers extraction done

u = x.find_all("span", {"class": "vi-buybox-watchcount"})

# print u

for item in u:
    try:
        print item.text
    except:
        pass

# returns details extraction done

t = x.find_all("span", {"id": "vi-ret-accrd-txt"})

# print t

for item in t:
    try:
        print item.text
    except:
        pass

#per hour day view done
a = x.find_all("div", {"class": "vi-notify-new-bg-dBtm"})

# print a

for item in a:
    try:
        print item.text
    except:
        pass

#trending at price
b = x.find_all("span", {"class": "mp-prc-red"})

#print b

for item in b:
    try:
        print item.text
    except:
        pass

1 个答案:

答案 0 :(得分:2)

你的问题有点模糊!

你在谈论哪些链接?单个ebay页面上有一百个。你想要哪些信息?同样地,也有一吨。

但无论如何,我会继续:

# First, create a list of urls you want to iterate on 

urls = []
soup = (re.text, "html.parser")

# Assuming your links of interests are values of "href" attributes within <a> tags 
a_tags = soup.find_all("a")
for tag in a_tags:
    urls.append(tag["href"])

# Second, start to iterate while storing the info
info_1, info_2 = [], []
for link in urls:
    # Do stuff here, maybe its time to define your existing loops as functions?
    info_a, info_b = YourFunctionReturningValues(soup)
    info_1.append(info_a)
    info_2.append(info_b)

然后,如果你想要一个不错的csv输出:

# Don't forget to import the csv module
with open(r"path_to_file.csv", "wb") as my_file:
    csv_writer = csv.writer(final_csv, delimiter = ",")
    csv_writer.writerows(zip(urls, info_1, info_2, info_3))

希望这会有所帮助吗?

当然,请不要犹豫,提供更多信息,以便获得更多详细信息

使用BeautifulSoup处理属性:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes

关于csv模块:https://docs.python.org/2/library/csv.html