Question

这是我要抓取的网页链接： https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html

我还应用了其他过滤器，方法是点击带圆圈的标题1

这是点击标题后网页的样子2

我想获取网页上显示的所有地点的名称，但我似乎遇到了问题，因为应用过滤器时 url 没有改变。我为此使用了 python urllib。这是我的代码：

url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)

Answer 1

您可以使用 bs4。 Bs4 是一个 python 模块，它允许您从网页中获取某些内容。这将从站点获取文本：

from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)

如果你想得到一些不是文本的东西，也许是带有特定标签的东西，你也可以使用 bs4:

soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title

找出所有地名的class和tag，然后用上面的得到所有的地名。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

如何使用网页抓取来获取网页上的可见文本？

1 个答案: