Question

到目前为止，这是我的代码：

import urllib2
with urllib2.urlopen("https://quora.com") as response:
    html = response.read()

我是Python的新手，不知何故我成功获取网页，现在如何从网页中提取ID和类？

Answer 1

更好的方法是使用BeautifulSoup（bs4）网页抓取库和请求。

使用pip安装后，您可以这样开始：

import requests 
from bs4 import BeautifulSoup

r = requests.get("http://quora.com")
soup = BeautifulSoup(r.content, "html.parser")

要查找具有特定ID的元素：

soup.find(id="your_id")

使用“答案”类查找所有元素：

soup.find_all(class_="Answer")

然后，您可以使用.get_text()删除html标记并使用python字符串操作来组织数据。

Answer 2

您可以尝试使用专用库解析html代码，例如BeautifulSoup。

Answer 3

您可以通过xml解析轻松完成

from lxml import html
import requests
page = requests.get('http://google.com')
with open('/home/Desktop/test.txt','wb') as f : 
   f.write(page.content)

如何使用python从网页中提取id和类？

3 个答案: