我必须在数百万条记录中进行搜索。为此,我有一个轻松的服务(第三方api),该服务在每个调用中提供25条记录。在每个响应中,我得到一个由25条记录,页数和总页数组成的数组,即,如果我给pagenumber = 2,则接下来的25条记录将到来。这意味着要获取所有数据,我将一直循环到最后一页号,为每个pageNumber调用,然后将记录追加到上一个集合的每个调用中。限制是每分钟通话100次后,服务器(第三方)开始拒绝通话。因此,我从不获取完整数据,最终执行所需的搜索。
我尝试遍历所有页面。 我尝试进行弹性搜索,但我认为我没有足够的理解来实现它。
public class EmpResponse
{
public int Pages;
public int PageNumber;
public List<Employee> TotalRecords;
public Employee GetAllEmployees(string empId= "", string EmpName = "", string Manager = "")
{
string url = "thirdPartyurl?PageNumber=";
string baseUrl = "thirdPartyurl?PageNumber=1";
EmpResponse baseRes = JsonConvert.DeserializeObject<EmpResponse>(DataHelpers.GetDataFromUrl(baseUrl));
for (int i = 2; i <= baseRes.Pages; i++)
{
EmpResponse currentRes = JsonConvert.DeserializeObject<EmpResponse>(DataHelpers.GetDataFromUrl(url + i));
if (currentRes != null)
foreach (var item in currentRes.TotalRecords)
{
baseRes.TotalRecords.Add(item);
}
}
return baseRes;
}
}
DataHelpers.GetDataFromUrl用于调用该URL并获取对该URL特定的响应。
现在baseRes.Pages变为100000(即baseRes.Pages = 100000),即100000个页面,这意味着100000个调用。这会花费很多时间,如果一分钟内呼叫数量超过100次,则第三方api将开始拒绝该呼叫,因此如何在此限制下快速获取整个数据。
答案 0 :(得分:-1)
如果尝试使用批量加载技术解决您的问题。这就是我所做的:)有趣的部分是python部分,因为golang网络服务器只是受限制的第三方服务:)
处理所有请求的HTTP服务器...
main.go
package main
import (
"fmt"
"io"
"math/rand"
"net/http"
"time"
)
const (
letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
letterIdxMax = 63 / letterIdxBits // # of letter indices fitting in 63 bits
)
var (
src = rand.NewSource(time.Now().UnixNano())
)
func generateData() string {
b := make([]byte, 64)
for i, cache, remain := 63, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return string(b)
}
func main() {
apiKeys := make(map[string]int)
apiKeys["abc"] = 0
// run api clear
ticker := time.NewTicker(5 * time.Second)
quit := make(chan struct{})
go func() {
for {
select {
case <-ticker.C:
apiKeys["abc"] = 0
case <-quit:
ticker.Stop()
return
}
}
}()
fmt.Printf("Loading basic http server\r\n")
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Printf("Got request: %s\r\n", r.URL.Query())
// check api key
if _, ok := apiKeys[r.URL.Query().Get("api")]; !ok {
w.WriteHeader(http.StatusBadRequest)
return
}
// check limitation
if apiKeys[r.URL.Query().Get("api")] >= 100 {
w.WriteHeader(http.StatusForbidden)
return
}
// get page number & generate data...
data := generateData()
io.WriteString(w, data)
// save request
apiKeys[r.URL.Query().Get("api")] = apiKeys[r.URL.Query().Get("api")] + 1
w.WriteHeader(http.StatusOK)
return
})
http.ListenAndServe(":1337", nil)
}
我们的请求者 main.py
import requests
import time
"""
load data
"""
thirdPartyUrl = "http://localhost:1337/?api=abc"
def main():
print("Loading program...")
allPages = 100000
currentPage = 0
alreadyVisited = 0
r = requests.session()
res = r.get("{0}&PageNumber={1}".format(thirdPartyUrl, currentPage))
if res.status_code == 200:
print("Everything fine! Go and get other content...")
while alreadyVisited <= allPages:
if currentPage >= 99:
time.sleep(6)
currentPage = 0
currentPage = currentPage + 1
alreadyVisited = alreadyVisited + 1
print("CurrentPage: {0} | Already visited: {1}".format(currentPage, alreadyVisited))
data = r.get("{0}&PageNumber={1}".format(thirdPartyUrl, currentPage))
print("Data: {0}".format(data.content))
main()
这只是一个原型:)