过滤第三方Rest API上的记录

时间:2019-01-20 20:08:00

标签: c# asp.net-web-api

我必须在数百万条记录中进行搜索。为此,我有一个轻松的服务(第三方api),该服务在每个调用中提供25条记录。在每个响应中,我得到一个由25条记录,页数和总页数组成的数组,即,如果我给pagenumber = 2,则接下来的25条记录将到来。这意味着要获取所有数据,我将一直循环到最后一页号,为每个pageNumber调用,然后将记录追加到上一个集合的每个调用中。限制是每分钟通话100次后,服务器(第三方)开始拒绝通话。因此,我从不获取完整数据,最终执行所需的搜索。

我尝试遍历所有页面。 我尝试进行弹性搜索,但我认为我没有足够的理解来实现它。

public class EmpResponse
{
 public int Pages;
 public int PageNumber;
 public List<Employee> TotalRecords; 

    public Employee GetAllEmployees(string empId= "", string EmpName = "", string Manager = "")
    {
        string url = "thirdPartyurl?PageNumber=";
        string baseUrl = "thirdPartyurl?PageNumber=1";
        EmpResponse baseRes = JsonConvert.DeserializeObject<EmpResponse>(DataHelpers.GetDataFromUrl(baseUrl));
        for (int i = 2; i <= baseRes.Pages; i++)
        {
            EmpResponse currentRes = JsonConvert.DeserializeObject<EmpResponse>(DataHelpers.GetDataFromUrl(url + i));
            if (currentRes != null)
                foreach (var item in currentRes.TotalRecords)
                {
                    baseRes.TotalRecords.Add(item);
                }
        }
        return baseRes;
    }

}

DataHelpers.GetDataFromUrl用于调用该URL并获取对该URL特定的响应。

现在baseRes.Pages变为100000(即baseRes.Pages = 100000),即100000个页面,这意味着100000个调用。这会花费很多时间,如果一分钟内呼叫数量超过100次,则第三方api将开始拒绝该呼叫,因此如何在此限制下快速获取整个数据。

1 个答案:

答案 0 :(得分:-1)

如果尝试使用批量加载技术解决您的问题。这就是我所做的:)有趣的部分是python部分,因为golang网络服务器只是受限制的第三方服务:)

  

处理所有请求的HTTP服务器...

main.go

package main

import (
    "fmt"
    "io"
    "math/rand"
    "net/http"
    "time"
)

const (
    letterBytes   = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    letterIdxBits = 6                    // 6 bits to represent a letter index
    letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
    letterIdxMax  = 63 / letterIdxBits   // # of letter indices fitting in 63 bits
)

var (
    src = rand.NewSource(time.Now().UnixNano())
)

func generateData() string {
    b := make([]byte, 64)
    for i, cache, remain := 63, src.Int63(), letterIdxMax; i >= 0; {
        if remain == 0 {
            cache, remain = src.Int63(), letterIdxMax
        }
        if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
            b[i] = letterBytes[idx]
            i--
        }
        cache >>= letterIdxBits
        remain--
    }

    return string(b)
}

func main() {
    apiKeys := make(map[string]int)

    apiKeys["abc"] = 0

    // run api clear
    ticker := time.NewTicker(5 * time.Second)
    quit := make(chan struct{})
    go func() {
        for {
            select {
            case <-ticker.C:
                apiKeys["abc"] = 0
            case <-quit:
                ticker.Stop()
                return
            }
        }
    }()

    fmt.Printf("Loading basic http server\r\n")
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        fmt.Printf("Got request: %s\r\n", r.URL.Query())

        // check api key
        if _, ok := apiKeys[r.URL.Query().Get("api")]; !ok {
            w.WriteHeader(http.StatusBadRequest)
            return
        }

        // check limitation
        if apiKeys[r.URL.Query().Get("api")] >= 100 {
            w.WriteHeader(http.StatusForbidden)
            return
        }

        // get page number & generate data...
        data := generateData()

        io.WriteString(w, data)

        // save request
        apiKeys[r.URL.Query().Get("api")] = apiKeys[r.URL.Query().Get("api")] + 1

        w.WriteHeader(http.StatusOK)
        return
    })

    http.ListenAndServe(":1337", nil)
}

  

我们的请求者   main.py

import requests
import time
"""
load data
"""

thirdPartyUrl = "http://localhost:1337/?api=abc"

def main():
    print("Loading program...")
    allPages = 100000
    currentPage = 0
    alreadyVisited = 0
    r = requests.session()
    res = r.get("{0}&PageNumber={1}".format(thirdPartyUrl, currentPage))

    if res.status_code == 200:
        print("Everything fine! Go and get other content...")
        while alreadyVisited <= allPages:
            if currentPage >= 99:
                time.sleep(6)
                currentPage = 0

            currentPage = currentPage + 1
            alreadyVisited = alreadyVisited + 1
            print("CurrentPage: {0} | Already visited: {1}".format(currentPage, alreadyVisited))

            data = r.get("{0}&PageNumber={1}".format(thirdPartyUrl, currentPage))
            print("Data: {0}".format(data.content))


main()

这只是一个原型:)