基于How to collect paginated API responses using spring boot WebClient?
我创建了以下爬虫类
class GitlabCrawler(private val client: WebClient, private val token: String) {
fun fetchCommits(project: URI): Flux<Commit> {
return fetchCommitsInternal(project).expand { cr: ClientResponse? ->
val nextUrl = getNextUrl(cr)
nextUrl?.let { fetchCommitsInternal(URI.create(it)) }
?: Mono.empty<ClientResponse>()
}.limitRate(1)
.flatMap { cr: ClientResponse? -> cr?.bodyToFlux(Commit::class.java) ?: Flux.empty() }
}
private fun getNextUrl(cr: ClientResponse?):String? {
// TODO replace with proper link parsing
return cr?.headers()?.header(HttpHeaders.LINK)?.firstOrNull()
?.splitToSequence(",")
?.find { it.endsWith("rel=\"next\"") }
?.let { it.substring(it.indexOf('<') + 1, it.lastIndexOf('>')) }
}
private fun fetchCommitsInternal(url: URI): Mono<ClientResponse> {
return client.get()
.uri(url)
.accept(MediaType.APPLICATION_JSON_UTF8)
.header("Private-Token", token)
.exchange()
}
}
data class Commit(
val id: String,
val message: String,
@JsonProperty("parent_ids") val parentIds: List<String>,
@JsonProperty("created_at") val createdAt: String)
我想避免不必要的请求,但是它执行的请求多于满足请求所需的请求。
gitlabCrawler.fetchCommits(URI.create("https://...")).take(15).collectList().block()
只需要一个请求,因为每个页面包含20个条目,但是它将启动第二个页面请求。似乎总是要求多于一页。我尝试使用limitRate
,但这似乎没有效果。
有没有办法让它变得懒惰,即仅在电流耗尽时才请求下一页?
答案 0 :(得分:0)
您确定它确实执行了请求吗? fetchCommitInternal
被调用意味着WebFlux
“准备”了请求,不一定是已执行(即已订阅)。
以下用例的简化显示了不同之处:
private static Tuple2<Integer, Flux<Integer>> nextPage(int index, int pageSize) {
System.out.println("prepared a request for page " + index);
return Tuples.of(index, Flux.range((pageSize * (index - 1)) + 1, pageSize));
}
@Test
public void expandLimitedRequest() {
int pageSize = 5;
Flux.just(nextPage(1, pageSize))
.doOnSubscribe(sub -> System.out.println("requested first page"))
.expand(page -> {
int currentPage = page.getT1();
if (currentPage < 3) {
int nextPage = currentPage + 1;
return Flux.just(nextPage(nextPage, pageSize))
.doOnSubscribe(sub -> System.out.println("requested page " + nextPage));
}
return Flux.empty();
})
.doOnNext(System.out::println)
.flatMap(Tuple2::getT2)
.doOnNext(System.out::println)
.take(8)
.blockLast();
}
打印:
prepared a request for page 1
requested first page
[1,FluxRange]
1
2
3
4
5
prepared a request for page 2
requested page 2
[2,FluxRange]
6
7
8
prepared a request for page 3
如您所见,它为第3页准备了请求,但从未执行(因为下游take
取消了expand
)。