将输出爬行到多个csv文件中

时间:2016-06-09 17:58:30

标签: python csv beautifulsoup web-crawler

我想知道如何将我的搜索结果导出到我已抓取的每个不同城市的多个csv文件中。不知怎的,我跑到墙上,没有得到适当的方法来解决它。

那是我的代码:

import requests
from bs4 import BeautifulSoup
import csv

user_agent = {'User-agent': 'Chrome/43.0.2357.124'}
output_file= open("TA.csv", "w", newline='')
RegionIDArray = [187147,187323,186338]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()

for reg in RegionIDArray:
    for page in range(1,700,30):
        r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
        soup = BeautifulSoup(r.content)

        g_data = soup.find_all("div", {"class": "element_wrap"})

        for item in g_data:
            header = item.find_all("div", {"class": "property_title"})
            item = (header[0].text.strip())
            if item not in already_printed:
                already_printed.add(item)

                print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]))

                writer = csv.writer(output_file)
                csv_fields = ['POI', 'Locaton']
                if g_data:
                    writer.writerow([str(item), str(dict[reg])])

我的目标是为巴黎,柏林和伦敦提供三种速度的CSV文件,而不是将所有结果都放在一个大的csv文件中。

你可以帮帮我吗?感谢您的反馈:)

1 个答案:

答案 0 :(得分:1)

我对您的代码进行了一些小修改。为每个语言环境创建文件,我将out_file名称移动到循环中。

注意,我现在没有时间,最后一行是忽略unicode错误的黑客 - 它只是跳过尝试输出一个非ascii字符的行。 Thas并不好。也许有人可以解决这个问题?

import requests
from bs4 import BeautifulSoup
import csv

user_agent = {'User-agent': 'Chrome/43.0.2357.124'}
RegionIDArray = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()

for reg in RegionIDArray:
    output_file= open("TA" + str(reg) + ".csv", "w")
    for page in range(1,700,30):
        r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
        soup = BeautifulSoup(r.content)

        g_data = soup.find_all("div", {"class": "element_wrap"})

        for item in g_data:
            header = item.find_all("div", {"class": "property_title"})
            item = (header[0].text.strip())
            if item not in already_printed:
                already_printed.add(item)

                # print("POI: " + str(item) + " | " + "Location: " + str(RegionIDArray[reg]))

                writer = csv.writer(output_file)
                csv_fields = ['POI', 'Locaton']
                if g_data:
                    try:
                        writer.writerow([str(item), str(RegionIDArray[reg])])
                    except:
                        pass