Question

因此，我尝试创建一个程序，在Google网页搜索中获取所有网址，并按照网页在该网页上的位置返回所有网址的列表。因此，如果它是谷歌搜索页面上的“随机”，this链接的顶部网址，那么列表中应该返回的第一个项目应为“https://www.random.org/”。这是因为它是您在源代码中随机搜索谷歌时的第一个链接。我正在使用urllib3和re模块，因为我真的不知道如何使用美丽的汤或lxml，但如果你可以用美丽的汤和/或lxml这样做也可以。到目前为止，这是我的代码：

import urllib.request
import re

def find(start,end):

    urls = []

    with open('data.txt', 'r') as myFile:
        pass # Needs to append the every instance of all urls between the start and end inputs in data.txt

    # Returns all instances of urls between the start and end paramaters in data.txt

    return urls


def parse(query):

    # Creates the url with the query

    url = 'https://www.google.com/search?q=' + query

    # Gets past googles attempt to block parsing

    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"

    # Fetches data

    req = urllib.request.Request(url, headers = headers)
    resp = urllib.request.urlopen(req)
    respData = resp.read()

    # Saves the source code in a txt file

    saveFile = open('data.txt','w')
    saveFile.write(str(respData))
    saveFile.close()

    # Finds the urls and returns them

    newUrl = find('<h3 class="r"><a href="','"')
    return newUrl

print(parse("random"))

问题：我的问题是让find（）函数工作，我不知道如何从data.txt和变量respData中保存的源代码中获取url，我想做使这个有效，所以我想使用正则表达式。但是我不知道如何根据url的开始位置（类位是find函数的参数）以及它的起始位置来获取源代码中的url（反转的逗号是另一个参数找功能）。

简化问题：鉴于某些文字data，您如何在data之间的两个字符串之间创建一个包含所有实例的列表{ {1}}和start。那么如何使finish中存储的大量数据高效，然后将其应用于原始代码中的find（）函数。

注意：因此，使用python 3.6.3，我不是使用urllib2而是使用urllib3。如果要在google搜索网页上获取每个网址需要很长时间，那么前10个网址就可以了。

Answer 1

用美丽的汤，然后你好。

from bs4 import BeautifulSoup
#code snip

resp = urllib.request.urlopen(req)
soup = BeautifulSoup(resp)

for x in soup.findAll('a', {"class": "r"}):
    print(x)

我还没有测试过，但这就是你在美丽的汤中搜索的方法

另外，单独使用Regex来解析html可能会非常棘手。最好使用Beautiful Soap 4或Scrapy来处理解析。

如何获取特定Google搜索python上的所有网址

1 个答案: