Question

很抱歉，如果这是重复的话，但是我一直在研究许多有关此问题的StackOverflow问题，但找不到类似的情况。我可能在这里树错了树，但是我是编程的新手，所以即使有人可以将我设置在正确的路径上，它也会极大地帮助您。

我正在尝试从只能使用python 3.7和Beautiful soup 4从我们的网络内部访问的网站上抓取数据。我的第一个问题是，这是针对新手程序员的最佳实践方法还是应该我正在寻找类似javascript而不是python的东西？

我的第二个问题是网站的html根文件具有以下html标签xmlns =“ http://www.w3.org/1999/xhtml”。 BeautifulSoup4是否可以与xhtml一起使用？

我承认我对网络开发一无所知，因此即使有人可以给我一些关键字或提示来开始研究，以使我走上一条更有生产力的道路，也将不胜感激。现在我最大的问题是我不知道我不知道什么，所有python webscraping示例都可以在更简单的.html页面上工作，而本页面中的树结构树由多个html / css / jpg和gif文件组成。 / p>

谢谢， -丹恩

Answer 1

Python，请求和BeautifulSoup绝对是必经之路，特别是对于初学者而言。 BeautifulSoup适用于html，xml等的所有变体。

您将需要安装python，然后安装请求和bs4。通过阅读requests docs和bs4 docs都很容易做到。

如果您还不了解python3，我建议您学习一些python3基础知识。

下面是一个简单的示例，用于获取您请求的页面的标题：

import requests
from bs4 import BeautifulSoup as bs

url = 'http://some.local.domain/'

response = requests.get(url)
soup = bs(response.text, 'html.parser')

# let's get title of the page
title = soup.title
print(title)

# let's get all the links in the page
links = soup.find_all('a')
for link in links:
    print(link.get('href'))
    link1 = link[0]
    link2 = link[1]

# let's follow a link we find in the page (we'll go for the first)
response = requests.get(link1, stream=True)
# if we have an image and we want to download it 
if response.status_code == 200:
    with open(url.split('/')[-1], 'wb') as f:
        for chunk in response:
            f.write(chunk)

# if the link is another web page
response = requests.get(link2)
soup = bs(response.text, 'html.parser')

# let's get title of the page
title = soup.title
print(title)

继续寻求有关请求的教程，BeautfiulSoup上有很多……like this one

使用Python网页抓取HTML

1 个答案: