Question

我正在做一个项目，我需要从网页上获取信息。我正在使用python和ghost。我在文档中看到了这段代码：

links = gh.evaluate("""
                    var linksobj = document.querySelectorAll("a");
                    var links = [];
                    for (var i=0; i<linksobj.length; i++){
                        links.push(linksobj[i].value);
                    }
                    links;
                """)

这段代码绝对不是python。它是哪种语言，我可以学习如何配置它？如何从标签中找到一个字符串，例如。在：

title>this is title of the webpage

我怎样才能获得

this is title of the page

感谢。

Answer 1

使用requests和beautifulSoup

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.google.com/")
soup = BeautifulSoup(r.text)
soup.title.string
In [3]: soup.title.string
Out[3]: u'Google'

Answer 2

ghost.py是一个webkit客户端。它允许您加载网页并与其DOM和运行时进行交互。

这意味着一旦安装并运行了所有内容，您就可以执行此操作：

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://stackoverflow.com/')
if page.http_status == 200:
    result, extra = ghost.evaluate('document.title;')
    print('The title is: {}'.format(result))

Answer 3

编辑：看了Padraic Cunningham的答案后，在我看来，我很遗憾误解了你的问题。任何我如何留下我的答案以供将来参考或可能为downvotes。：P

如果你收到的输出是一个字符串，那么python中的常见字符串操作可以实现你在问题中提到的所需输出。

您收到：title>this is title of the webpage

你希望：this is title of the webpage

假设您收到的输出始终采用相同的格式，因此您可以执行以下字符串操作以获得所需的输出。使用split操作：

>>> s = 'title>this is title of the webpage'
>>> p = s.split('>')
>>> p
 ['title', 'this is title of the webpage']
>>> p[1]
'this is title of the webpage'

此处p是一个列表，因此您必须访问包含所需输出的正确元素。

或者更简单的方法是创建子字符串。

>>> s = 'title>this is title of the webpage'
>>> p = s[6:]
>>> p
'this is title of the webpage'

上面的代码片段中的

p = s[6:]表示您需要一个字符串，其内容从{7}到第7个元素都是title>this is title of the webpage。换句话说，您忽略了第一个6元素。

如果您收到的输出格式并不总是相同，那么您可能更喜欢使用regular expressions。

您的第二个问题已在评论部分得到解答。我希望我能正确理解你的问题。

从Ghost.py文件中获取信息

3 个答案: