Question

我对Python很新，但我想知道我是否可以利用Python及其模块。要从第1页检索href，然后在第2页中检索第一段。

Q2：另外，我怎么能在第一页上用相同的div类刮掉前10个链接hrefs，然后在循环时刮掉前10个段落？

Answer 1

是的，我相信你应该能够。

尝试查找请求和beautifulsoup python模块。

Answer 2

我将使用两个python模块：requests和regular expressions。我会使用请求获取网站原始html然后使用正则表达式来获取例如你的段落：

import requests, re
site = requests.get("http://somewebsite.com").text
paragraphs = re.findall(r"<p>(.*?)</p>", site, re.DOTALL)
firstPara = paragraphs[0]
print(firstPara)

此处的请求行是不言自明的，正则表达式表示查找第一个<p>标记，然后括号表示只返回.*?的这个位，其中(*)为(.) charachters (?)最多</p>结束[root@srv32 ~]# cat test1.txt apple orange grapes banana cherry [root@srv32 ~]# cat test2.txt carrot lettuce garlic artichoke lemon标记。最后，re.DOTALL意味着它将数字换行作为查找的一部分

Answer 3

使用beautifulsoup的另一种方法是使用webbrowser模块。使用webbrowser模块，您可以在默认的Web浏览器中打开，或者甚至指定要打开的首选浏览器（但最好使用默认浏览器，但当然不能保证用户的偏好与您的首选项匹配）！）

所以你可以打开一个这样的网址：

import webbrowser

webbrowser.open_new('https://stackoverflow.com/help/formatting')

或者像这样：

import webbrowser

a = webbrowser.get('chrome') #target chrome (e.g)
a.open('https://www.stackoverflow.com')

不幸的是，如果您只是将一个＃标签（用于锚点）粘贴到网址的末尾，那么webbrowser似乎并不喜欢这样。相反，您应该使用变量定义锚点并将其作为参数传递给函数：

def open_anchor(self, anchor):
        """ Open selected anchor in the default webbrowser
        """
        webbrowser.open( anchor )

此page

上有更多网络浏览器示例

希望这有帮助

Python可以在第一页上获得一个Href链接，然后从第2页获得一个段落吗？

3 个答案: