如何防止网页抓取工具影响我的缓存(Redis,CDN等?)

时间:2016-08-05 15:57:09

标签: caching web-crawler robots.txt cache-control

理论:如果网络抓取工具抓取我的整个网站,我的默认缓存机制(例如Redis)将被淹没,并可能使错误的数据老化。 (取决于缓存策略)。

假设网络抓取工具不需要我向最终用户提供的性能提升,我可以编辑我的应用程序以保护缓存"

问题

  • 这是个好主意吗?
  • 网络抓取工具是否衡量投放内容之间的时差?
  • 除了用户代理之外,我应该"标记"一个引用robots.txt并假设它们是爬虫的会话?
  • 我应该以管理方式或以编程方式处理此交付?
  • 在一个极端的示例中,我可以限制网络爬虫吗?

如果我以编程方式实现此操作,我需要告诉 GetFromCacheAsync 不根据某些客户端信息更新缓存。

  • 添加方法重载以确定是否应更新缓存是否违反任何域驱动设计理论?
  • 我应该在哪里放置"的逻辑来更新Redis"或者"不要更新Redis" ......我认为这方面与DDD最相关

HomeController.cs

 public async Task<ActionResult> Events()
 {
      ViewBag.Events = await eventSvc.GetLiveEvents(DateTime.Now);
     return View();
 }

Services.EventManagementService.cs

public async Task<List<Event>> GetLiveEvents(DateTime currentDate)
{
   //return ctx.Events.Where(e => e.StatusId == (int)EventStatus.Live && e.EventDate >= DateTime.Now).ToList();
  return await cloudCtx.GetLiveEvents(DateTime.Now);
}

Data.CloudContext.cs

    public async Task<List<Event>> GetLiveEvents(DateTime currentDate)
    {
        string year = currentDate.Year.ToString();
        var key = GenerateLiveEventsKey(year); 

        var yearEvents = await cache.GetFromCacheAsync<List<Event>>(key, async () =>
        { 
            List<Event> events = new List<Event>();
            string partitionKey = year;

            TableQuery<EventRead> query = new TableQuery<EventRead>().Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey));
            TableQuerySegment<EventRead> currentSegment = null;
            var result = tableEvents.ExecuteQuery(query);
            while (currentSegment == null || currentSegment.ContinuationToken != null)
            {
                currentSegment = await tableMyEvents.ExecuteQuerySegmentedAsync(query, currentSegment != null ? currentSegment.ContinuationToken : null);
                foreach (EventRead nosqlEvent in currentSegment.Results)
                {
                    var eventObj = nosqlEvent.ToEvent(true);
                    events.Add(eventObj);
                }
            }

            return events;
        });
        return yearEvents.Where(e => e.EventDate >= currentDate).ToList();
    }

Data.Cache.cs

  public async Task<T> GetFromCacheAsync<T>(string key, Func<Task<T>> missedCacheCall, TimeSpan timeToLive)
    {
        if (!IsCacheAvailable)
        {
            var ret = await missedCacheCall();
            return ret;
        }

        IDatabase cache = Connection.GetDatabase();
        var obj = await cache.GetAsync<T>(key);
        if (obj == null)
        {
            obj = await missedCacheCall();
            if (obj != null)
            {
                cache.Set(key, obj);
            }
        }
        return obj;
    }

1 个答案:

答案 0 :(得分:0)

Do web crawlers measure the time difference between delivered content?

如果没有损坏(或者首先没有真正的问题),请不要修理它。

Aside from user agent, should I "tag" a session that references
robots.txt and assume they are a crawler?
谷歌确实如此。如有必要,它的机器人会减慢爬行速度。

How should I administratively, or programmatically handle this delivery?

用户代理就够了。

In an extreme example, can I throttle a web crawler?

返回503以告诉机器人他们过于频繁。

{{1}}

是的,见上文。