python webscraping并将数据写入csv

时间:2016-10-01 17:20:59

标签: python python-2.7 csv web-scraping beautifulsoup

我正在尝试将所有数据(即所有页面)保存在单个csv文件中,但此代码仅保存最终页面数据.Eg此处url []包含2个URL。最终的csv只包含第二个url数据。 我显然在循环中做错了什么。但我不知道是什么。 此页面还包含100个数据点。但是这段代码只写了前44行。 请帮助解决这个问题.............

from bs4 import BeautifulSoup
import requests
import csv
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
for ur in url:
    r = requests.get(ur)
    soup = BeautifulSoup(r.content)
    g_data = soup.find_all("a", {"class": "hdrlnk"})
    gen_list=[]
    for row in g_data:
       try:
            name = row.text
       except:
            name=''
       try:
            link = "http://sfbay.craigslist.org"+row.get("href")
       except:
            link=''
       gen=[name,link]
       gen_list.append(gen)

with open ('filename2.csv','wb') as file:
    writer=csv.writer(file)
    for row in gen_list:
        writer.writerow(row)

2 个答案:

答案 0 :(得分:3)

gen_list在你的循环中再次初始化,遍历网址。

gen_list=[]

将此行移至for循环之外。

...
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
gen_list=[]
for ur in url:
...

答案 1 :(得分:0)

我后来发现你的帖子,想尝试这种方法:

import requests
from bs4 import BeautifulSoup
import csv

final_data = []
url = "https://sfbay.craigslist.org/search/sss"
r = requests.get(url)
data = r.text

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all(class_="result-row")

for details in get_details:
    getclass = details.find_all(class_="hdrlnk")
    for link in getclass:
        link1 = link.get("href")
        sublist = []
        sublist.append(link1)
        final_data.append(sublist)
print(final_data)

filename = "sfbay.csv"
with open("./"+filename, "w") as csvfile:
    csvfile = csv.writer(csvfile, delimiter = ",")
    csvfile.writerow("")
    for i in range(0, len(final_data)):
        csvfile.writerow(final_data[i])