刮擦可见文字

时间:2016-11-12 18:08:54

标签: python web-scraping beautifulsoup urllib2

我是网络抓取领域的绝对新手,现在我想从网页中提取可见文本。我在网上找到了一段代码:

import urllib2
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)

soup = BeautifulSoup(url , "lxml")
print (soup.prettify())

对于上面的代码,我得到以下结果:

    /usr/local/lib/python2.7/site-packages/bs4/__init__.py:282: UserWarning: "http://www.espncricinfo.com/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
<html>
 <body>
  <p>
   http://www.espncricinfo.com/
  </p>
 </body>
</html>

无论如何,我可以获得更具体的结果以及代码发生了什么错误。对不起是无能为力。

2 个答案:

答案 0 :(得分:1)

尝试传递html文档而不是url来美化为:

import urllib2
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)

soup = BeautifulSoup(web_page , 'html.parser')
print (soup.prettify().encode('utf-8'))

答案 1 :(得分:1)

soup = BeautifulSoup(web_page, "lxml")

你应该将类似文件的对象传递给BeautifulSoup,而不是url。

网址由urllib2.urlopen(url)处理并存储在web_page