我想定期检查Google列出的子域名。
要获取子域列表,请在Google搜索框中键入“site:example.com” - 这会列出所有子域结果(我们的域名超过20页)。
仅提取“site:example.com”搜索返回的地址的网址的最佳方法是什么?
我正在考虑编写一个小的python脚本,它将执行上述搜索并从搜索结果中重新编写URL(在所有结果页面上重复)。这是一个好的开始吗?可以有更好的方法吗?
干杯。
答案 0 :(得分:16)
正则表达式是解析HTML的一个坏主意。阅读并依赖格式良好的HTML是神秘的。
为Python尝试BeautifulSoup。这是一个示例脚本,它返回网站前10页的网址:domain.com Google查询。
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,10):
url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
for cite in soup.findAll('cite'):
print cite.text
输出:
stackoverflow.com/
stackoverflow.com/questions
stackoverflow.com/unanswered
stackoverflow.com/users
meta.stackoverflow.com/
blog.stackoverflow.com/
chat.meta.stackoverflow.com/
...
当然,您可以将每个结果附加到列表中,以便为子域解析它。我刚刚进入Python并在几天前刮擦,但这应该让你开始。
答案 1 :(得分:3)
Google Custom Search API可以以ATOM XML格式提供结果
答案 2 :(得分:0)
另一种使用 requests
, bs4
的方法:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'site:minecraft.fandom.com'}
html = requests.get(f'https://www.google.com/search?q=',
headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
link = container.find('a')['href']
print(link)
输出:
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
要使用分页从每个页面获取这些结果:
from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml
def print_extracted_data_from_url(url):
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
print(f'Current URL: {url}')
print()
for container in soup.findAll('div', class_='tF2Cxc'):
head_link = container.a['href']
print(head_link)
return soup.select_one('a#pnnext')
def scrape():
next_page_node = print_extracted_data_from_url(
'https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com')
while next_page_node is not None:
next_page_url = urllib.parse.urljoin('https://www.google.com', next_page_node['href'])
next_page_node = print_extracted_data_from_url(next_page_url)
scrape()
部分输出:
Results via beautifulsoup
Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
或者,您可以使用来自 SerpApi 的 Google Search Engine Results API。这是一个付费 API,可免费试用 5,000 次搜索。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "site:minecraft.fandom.com",
"api_key": os.getenv('API_KEY')
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
输出:
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
使用分页:
import os
from serpapi import GoogleSearch
def scrape():
params = {
"engine": "google",
"q": "site:minecraft.fandom.com",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
print(f"Current page: {results['serpapi_pagination']['current']}")
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
while 'next' in results['serpapi_pagination']:
search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
results = search.get_dict()
print(f"Current page: {results['serpapi_pagination']['current']}")
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
scrape()
<块引用>
免责声明,我为 SerpApi 工作。