Question

在以下robots.txt文件中，该命令表示不允许使用magpie-crawler的所有目录。假设我使用了Scrapy等其他网络爬虫。这个robots.txt文件未列出任何其他内容，因此，允许抓取的漫游器抓取吗？

User-agent: magpie-crawler
Disallow: /


Sitemap: https://www.digitaltrends.com/sitemap_index.xml
Sitemap: https://www.digitaltrends.com/news.sitemap.google.xml
Sitemap: https://www.digitaltrends.com/image-sitemap-index.xml

Answer 1

根据official website，这确实意味着仅禁止使用单个bot。如果需要，可以使用Scrapy。

如果他们愿意，他们只能允许一个机器人：

User-agent: Google
Disallow: 

User-agent: * 
Disallow: /

Answer 2

您可以使用Scrapy解析数据。只需在标题中说明您是Scrapy设置中的WEB浏览器：

'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36'

读取robots.txt文件

2 个答案: