Wikipedia表的Python Scrape然后导出到csv

时间:2019-06-18 19:49:01

标签: python web-scraping beautifulsoup python-requests export-to-csv

我遵循了教程中的步骤来刮取一个表,然后将数据导出到一个csv文件中。尝试执行

文件时,我通过PyCharm遇到错误

“追溯(最近一次通话):   在第1行的文件“ I:/Scrape/MediumCode.py”中     汇入要求 ModuleNotFoundError:没有名为“ requests”的模块

我还假设代码及其逻辑中还有其他错误,但这是我遇到的第一个问题,在不理解为什么无法识别该库的情况下无法进一步研究

成功执行pip安装请求

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://en.wikipedia.org/wiki/Public_holidays_in_Switzerland'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.findAll("table", {"class":"wikitable"})

filename = "holidays.csv"
f = open(filename, "w")

headers = "holiday, holiday_date"

f.write(headers)

for container in containers:
    holiday = container.table.tbody.tr.td.a["title"]

    name_container = container.findAll("a", {"class":"title"})
    holiday_name = name_container[0].text

    date_container = container.findAll("td")
    date = date_container[0].text.strip()

    print("holiday: " + brand)
    print("holiday_name: " + holiday_name)
    print("date: " + date)

    f.write(holiday + "," + holiday_name.replace(",", "|") + "," + date + "\n")

    f.close()

2 个答案:

答案 0 :(得分:0)

使用您的代码,我可以获得page_html很好。因此,由于某种原因,您的系统不喜欢urllib.request。请注意,requestrequests不太相同。据我了解,requests是建立在urllib3之上的,而urllib.request是在标准库中的,即使它们都指向了一些常见的东西。

此代码对您有用吗?

from urllib import request

my_url = 'https://en.wikipedia.org/wiki/Public_holidays_in_Switzerland'
p = request.urlopen(my_url)
print(p.read())

答案 1 :(得分:0)

使用pandas库将假期表数据保存到holiday_data.csv文件中,并在当前项目目录中创建csv文件。

import requests
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Public_holidays_in_Switzerland'
response = requests.get(url)

tables = pd.read_html(response.text)

# write holiday table data into `holiday_data` csv file
tables[0].to_csv("holiday_data.csv")
  

安装熊猫库

pip3 install pandas

如果requests库仍未在系统中引发错误,请尝试以下操作:

from urllib.request import urlopen as uReq
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Public_holidays_in_Switzerland'
response = uReq(url)
tables = pd.read_html(response.read())
#select only holiday column
select_table_column = ["Holiday"]
'''
    #or select multiple columns 
    select_table_column = ["Holiday","Date"]

'''
# filter table data by selected columns
holiday = tables[0][select_table_column]

# # write holiday table data into `holiday_data` csv file and set csv header
holiday.to_csv("holiday_data.csv",header=True)