我正在尝试使用Google的搜索结果获取HTML。例如,将GET请求发送到:
https://www.google.ru/?q=1111
但如果在浏览器中一切正常,当我尝试使用curl或在Google中获取“查看源代码”时,只有一些Javascript代码,没有搜索结果。这是某种保护吗?我该怎么办?
答案 0 :(得分:6)
您现在必须使用Google Search API发出GET请求。
所有其他方法都已被阻止。
答案 1 :(得分:2)
您问题中的页面是带有输入字段的 Google 搜索页面。
搜索结果页面是这样的:
https://www.google.ru/search?q=1111
轮换代理和用户代理,并延迟类似请求,以从 Google 搜索结果页面获取 HTML 并减少禁止次数。
或者使用 SerpApi 访问 HTML 和从中提取的数据。它有一个免费试用版。
curl -s 'https://serpapi.com/search?q=coffee'
输出
{
// Omitted
"organic_results": [
{
"position": 1,
"title": "Coffee - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Coffee",
"displayed_link": "en.wikipedia.org › wiki › Coffee",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...",
"sitelinks": {
"expanded": [
{
"title": "History",
"link": "https://en.wikipedia.org/wiki/History_of_coffee",
"snippet": "The history of coffee dates back to the 15th century, and possibly ..."
},
{
"title": "International Coffee Day",
"link": "https://en.wikipedia.org/wiki/International_Coffee_Day",
"snippet": "International Coffee Day (1 October) is an occasion that is ..."
},
{
"title": "List of coffee drinks",
"link": "https://en.wikipedia.org/wiki/List_of_coffee_drinks",
"snippet": "Milk coffee - Nitro cold brew coffee - List of coffee dishes - ..."
},
{
"title": "Portal:Coffee",
"link": "https://en.wikipedia.org/wiki/Portal:Coffee",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the ..."
},
{
"title": "Coffee bean",
"link": "https://en.wikipedia.org/wiki/Coffee_bean",
"snippet": "A coffee bean is a seed of the Coffea plant and the source for ..."
},
{
"title": "Geisha",
"link": "https://en.wikipedia.org/wiki/Geisha_(coffee)",
"snippet": "Geisha coffee, sometimes referred to as Gesha coffee, is a type of ..."
}
],
"list": [
{
"date": "Color: Black, dark brown, light brown, beige"
}
]
},
"rich_snippet": {
"bottom": {
"detected_extensions": {
"introduced_th_century": 15
},
"extensions": [
"Introduced: 15th century",
"Color: Black, dark brown, light brown, beige"
]
}
},
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:U6oJMnF-eeUJ:https://en.wikipedia.org/wiki/Coffee+&cd=2&hl=sv&ct=clnk&gl=se",
"related_pages_link": "https://www.google.se/search?gl=se&hl=sv&q=related:https://en.wikipedia.org/wiki/Coffee+coffee&sa=X&ved=2ahUKEwjJ9p2p_KXuAhVlRN8KHf22D8wQHzABegQIAhAJ"
}
},
// ...
}
免责声明:我在 SerpApi 工作。
答案 2 :(得分:0)
为答案添加更多的酱汁,因为它们不正确,甚至不回应您的问题。
首先,只要您不通过它(类似DoS)损害其服务,刮取Google是完全合法的。
这些方法也没有被阻止,它并不那么简单。
速度取决于你的方法,它不一定非常慢..
如果需要,您可以在一分钟内抓取数万个关键字页面。
您可以在此处找到更好的主题答案:Is it ok to scrape data from Google results?
你的curl问题确实来自保护,谷歌不允许自动访问,它有一套非常复杂的检测算法。
它们从简单的用户代理检查(直接阻止您的内容)到试图检测异常查询或相关查询的人工智能。
答案 3 :(得分:-2)
您可以在浏览器中加载它,然后通过Javascript抓取结果。
或者您可以使用Google API,但如果您每天要求的次数超过100次,则需要付款。