Question

我正在尝试使用BeautifulSoup使用以下代码抓取网页：

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
    s = url.read()

soup = BeautifulSoup(s)

with open("scraped.txt", "w", encoding="utf-8") as f:
    f.write(soup.get_text())
    f.close()

问题在于它保存了Wikipedia's main page而不是特定文章。为什么地址不起作用，我应该如何更改？

Answer 1

页面的正确网址为http://en.wikipedia.org/wiki/Markov_chain：

>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/Markov_chain"
>>> soup = BeautifulSoup(urllib.request.urlopen(url))
>>> soup.title
<title>Markov chain - Wikipedia, the free encyclopedia</title>

Answer 2

@alecxe 的回答会产生：

**GuessedAtParserWarning**: 
No parser was explicitly specified, so I'm using the best 
available HTML parser for this system ("html.parser"). This usually isn't a problem, 
but if you run this code on another system, or in a different virtual environment, it 
may use a different parser and behave differently. The code that caused this warning
is on line 25 of the file crawl.py. 

To get rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.

这是使用 requests 的 没有 GuessedAtParserWarning 的解决方案：

# crawl.py

import requests

url = 'https://www.sap.com/belgique/index.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

file = path.join(path.dirname(__file__), 'downl.txt')

# Either print the title/text or save it to a file
print(soup.title)
# download the text
with open(file, 'w') as f:
    f.write(soup.text)

使用BeautifulSoup保存网页内容

2 个答案: