理论:如果网络抓取工具抓取我的整个网站,我的默认缓存机制(例如Redis)将被淹没,并可能使错误的数据老化。 (取决于缓存策略)。
假设网络抓取工具不需要我向最终用户提供的性能提升,我可以编辑我的应用程序以保护缓存"
问题
如果我以编程方式实现此操作,我需要告诉 GetFromCacheAsync 不根据某些客户端信息更新缓存。
HomeController.cs
public async Task<ActionResult> Events()
{
ViewBag.Events = await eventSvc.GetLiveEvents(DateTime.Now);
return View();
}
Services.EventManagementService.cs
public async Task<List<Event>> GetLiveEvents(DateTime currentDate)
{
//return ctx.Events.Where(e => e.StatusId == (int)EventStatus.Live && e.EventDate >= DateTime.Now).ToList();
return await cloudCtx.GetLiveEvents(DateTime.Now);
}
Data.CloudContext.cs
public async Task<List<Event>> GetLiveEvents(DateTime currentDate)
{
string year = currentDate.Year.ToString();
var key = GenerateLiveEventsKey(year);
var yearEvents = await cache.GetFromCacheAsync<List<Event>>(key, async () =>
{
List<Event> events = new List<Event>();
string partitionKey = year;
TableQuery<EventRead> query = new TableQuery<EventRead>().Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey));
TableQuerySegment<EventRead> currentSegment = null;
var result = tableEvents.ExecuteQuery(query);
while (currentSegment == null || currentSegment.ContinuationToken != null)
{
currentSegment = await tableMyEvents.ExecuteQuerySegmentedAsync(query, currentSegment != null ? currentSegment.ContinuationToken : null);
foreach (EventRead nosqlEvent in currentSegment.Results)
{
var eventObj = nosqlEvent.ToEvent(true);
events.Add(eventObj);
}
}
return events;
});
return yearEvents.Where(e => e.EventDate >= currentDate).ToList();
}
Data.Cache.cs
public async Task<T> GetFromCacheAsync<T>(string key, Func<Task<T>> missedCacheCall, TimeSpan timeToLive)
{
if (!IsCacheAvailable)
{
var ret = await missedCacheCall();
return ret;
}
IDatabase cache = Connection.GetDatabase();
var obj = await cache.GetAsync<T>(key);
if (obj == null)
{
obj = await missedCacheCall();
if (obj != null)
{
cache.Set(key, obj);
}
}
return obj;
}
答案 0 :(得分:0)
Do web crawlers measure the time difference between delivered content?
如果没有损坏(或者首先没有真正的问题),请不要修理它。
Aside from user agent, should I "tag" a session that references
robots.txt and assume they are a crawler?
谷歌确实如此。如有必要,它的机器人会减慢爬行速度。
How should I administratively, or programmatically handle this delivery?
用户代理就够了。
In an extreme example, can I throttle a web crawler?
返回503以告诉机器人他们过于频繁。
{{1}}
是的,见上文。