Question

所以，我现在正在研究Python，因为我很久以前就研究过它，并且没有太多地学习语言，现在，我正在研究再一次。

我现在正在调查的是网络抓取工具，但我不确定这是否是正确的想法，我认为我正在考虑这个项目..请如果我错了，请纠正我，但这是我想到的项目

我想编写一个程序代码，我只需启动它，然后输入一个网站网址（特定的或完整的网站），然后扫描它以获取Embed / iFrame代码，然后下载链接到表格，如：

页面标题 - | - iFrame的发现＃| -Embed1- / / Embed1- | -Embed2- / / Embed2- 等等。

我是否正在研究正确的语言和方面，或者我应该为此寻找其他的东西？

提前感谢您的任何反馈/支持！

Answer 1

有多种方法可以抓取网站。以下是使用BeautifulSoup的示例您可以使用
安装BeautifulSoup pip install python-bs4的{{1}} windows

的apt-get install python-bs4

您可以开始使用here

工作代码

linux

输出：

from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts').read()
soup = BeautifulSoup(r)
print soup.prettify()[0:1000]

您可以使用输出来过滤所需的内容，例如<class 'bs4.BeautifulSoup'> <!DOCTYPE html>     <html class="no-js" lang="en-US">  <head> <title> Access denied | www.aflcio.org used Cloudflare to restrict access </title> <meta charset="utf-8"/> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/> <meta content="noindex, nofollow" name="robots"/> <meta content="width=device-width,initial-scale=1,maximum-scale=1" name="viewport"/> <link href="/cdn-cgi/styles/cf.errors.css" id="cf_styles-css" media="screen,projection" rel="stylesheet" type="text/css"/> <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-- >>>。更多详情here。

Python |网络爬虫|我用它了吗？

1 个答案: