我有一些相当简单的F#异步代码可以从维基百科上下载一百篇随机文章(用于研究)。
出于某种原因,代码在下载期间的任意时间点挂起。有时它是在50之后,有时是在80之后。
异步代码本身非常简单:
let parseWikiAsync(url:string, count:int ref) =
async {
use wc = new WebClientWithTimeout(Timeout = 5000)
let! html = wc.AsyncDownloadString(Uri(url))
let ret =
try html |> parseDoc |> parseArticle
with | ex -> printfn "%A" ex; None
lock count (fun () ->
if !count % 10 = 0 then
printfn "%d" !count
count := !count + 1
)
return ret
}
因为我无法通过fsi弄清楚问题是什么,所以我创建了WebClientWithTimeout,一个允许我指定超时的System.Net.WebClient
包装器:
type WebClientWithTimeout() =
inherit WebClient()
member val Timeout = 60000 with get, set
override x.GetWebRequest uri =
let r = base.GetWebRequest(uri)
r.Timeout <- x.Timeout
r
然后我使用异步组合器来检索超过一百页,并清除所有返回parseWikiAsync调用的文章,这些调用返回None
(其中大多数是“消歧页面”)直到我完全100文章:
let en100 =
let count = ref 0
seq { for _ in 1..110 -> parseWikiAsync("http://en.wikipedia.org/wiki/Special:Random", count) }
|> Async.Parallel
|> Async.RunSynchronously
|> Seq.choose id
|> Seq.take 100
当我编译代码并在调试器中运行它时,只有三个线程,其中只有一个运行实际代码 - Async管道。其他两个对于位置“不可用”,而在调用堆栈中没有任何内容。
我认为这意味着它不会卡在AsyncDownloadString
或parseWikiAsync中的任何位置。还有什么可能导致这个?
哦,最初,异步代码实际启动前需要大约一分钟。之后它会以相当合理的速度运行,直到它无限期地再次挂起。
这是主线程的调用堆栈:
> mscorlib.dll!System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle waitableSafeHandle, long millisecondsTimeout, bool hasThreadAffinity, bool exitContext) + 0x22 bytes
mscorlib.dll!System.Threading.WaitHandle.WaitOne(int millisecondsTimeout, bool exitContext) + 0x28 bytes
FSharp.Core.dll!Microsoft.FSharp.Control.AsyncImpl.ResultCell<Microsoft.FSharp.Control.AsyncBuilderImpl.Result<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>>.TryWaitForResultSynchronously(Microsoft.FSharp.Core.FSharpOption<int> timeout) + 0x36 bytes
FSharp.Core.dll!Microsoft.FSharp.Control.CancellationTokenOps.RunSynchronously<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>(System.Threading.CancellationToken token, Microsoft.FSharp.Control.FSharpAsync<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]> computation, Microsoft.FSharp.Core.FSharpOption<int> timeout) + 0x1ba bytes
FSharp.Core.dll!Microsoft.FSharp.Control.FSharpAsync.RunSynchronously<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>(Microsoft.FSharp.Control.FSharpAsync<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]> computation, Microsoft.FSharp.Core.FSharpOption<int> timeout, Microsoft.FSharp.Core.FSharpOption<System.Threading.CancellationToken> cancellationToken) + 0xb9 bytes
WikiSurvey.exe!<StartupCode$WikiSurvey>.$Program.main@() Line 97 + 0x55 bytes F#
答案 0 :(得分:8)
维基百科在这里不应该受到责备,这是Async.Parallel
内部工作的结果。 Async.Parallel
的类型签名为seq<Async<'T>> -> Async<'T[]>
。它返回一个包含序列中所有结果的Async值 - 因此在seq<Async<'T>>
返回所有计算之前它不会返回。
为了说明,我修改了您的代码,以便跟踪未完成的请求数,即已发送到服务器但尚未收到/解析响应的请求。
open Microsoft.FSharp.Control
open Microsoft.FSharp.Control.WebExtensions
open System
open System.Net
open System.Threading
type WebClientWithTimeout() =
inherit WebClient()
let mutable timeout = -1
member __.Timeout
with get () = timeout
and set value = timeout <- value
override x.GetWebRequest uri =
let r = base.GetWebRequest(uri)
r.Timeout <- x.Timeout
r
type ParsedDoc = ParsedDoc
type ParsedArticle = ParsedArticle
let parseDoc (str : string) = ParsedDoc
let parseArticle (doc : ParsedDoc) = Some ParsedArticle
/// A synchronized wrapper around Console.Out so we don't
/// get garbled console output.
let synchedOut =
System.Console.Out
|> System.IO.TextWriter.Synchronized
let parseWikiAsync(url : string, outstandingRequestCount : int ref) =
async {
use wc = new WebClientWithTimeout(Timeout = 5000)
wc.Headers.Add ("User-Agent", "Friendly Bot 1.0 (FriendlyBot@friendlybot.com)")
// Increment the outstanding request count just before we send the request.
do
// NOTE : The message must be created THEN passed to synchedOut.WriteLine --
// piping it (|>) into synchedOut.WriteLine or using fprintfn causes a closure
// to be created which somehow defeats the synchronization and garbles the output.
let msg =
Interlocked.Increment outstandingRequestCount
|> sprintf "Outstanding requests: %i"
synchedOut.WriteLine msg
let! html = wc.AsyncDownloadString(Uri(url))
let ret =
try html |> parseDoc |> parseArticle
with ex ->
let msg = sprintf "%A" ex
synchedOut.WriteLine msg
None
// Decrement the outstanding request count now that we've
// received a reponse and parsed it.
do
let msg =
Interlocked.Decrement outstandingRequestCount
|> sprintf "Outstanding requests: %i"
synchedOut.WriteLine msg
return ret
}
/// Writes a message to the console, passing a value through
/// so it can be used within a function pipeline.
let inline passThruWithMessage (msg : string) value =
Console.WriteLine msg
value
let en100 =
let outstandingRequestCount = ref 0
seq { for _ in 1..120 ->
parseWikiAsync("http://en.wikipedia.org/wiki/Special:Random", outstandingRequestCount) }
|> Async.Parallel
|> Async.RunSynchronously
|> passThruWithMessage "Finished running all of the requests."
|> Seq.choose id
|> Seq.take 100
如果您编译并运行该代码,您将看到如下输出:
Outstanding requests: 4
Outstanding requests: 2
Outstanding requests: 1
Outstanding requests: 3
Outstanding requests: 5
Outstanding requests: 6
Outstanding requests: 7
Outstanding requests: 8
Outstanding requests: 9
Outstanding requests: 10
Outstanding requests: 12
Outstanding requests: 14
Outstanding requests: 15
Outstanding requests: 16
Outstanding requests: 17
Outstanding requests: 18
Outstanding requests: 13
Outstanding requests: 19
Outstanding requests: 20
Outstanding requests: 24
Outstanding requests: 22
Outstanding requests: 26
Outstanding requests: 27
Outstanding requests: 28
Outstanding requests: 29
Outstanding requests: 30
Outstanding requests: 25
Outstanding requests: 21
Outstanding requests: 23
Outstanding requests: 11
Outstanding requests: 29
Outstanding requests: 28
Outstanding requests: 27
Outstanding requests: 26
Outstanding requests: 25
Outstanding requests: 24
Outstanding requests: 23
Outstanding requests: 22
Outstanding requests: 21
Outstanding requests: 20
Outstanding requests: 19
Outstanding requests: 18
Outstanding requests: 17
Outstanding requests: 16
Outstanding requests: 15
Outstanding requests: 14
Outstanding requests: 13
Outstanding requests: 12
Outstanding requests: 11
Outstanding requests: 10
Outstanding requests: 9
Outstanding requests: 8
Outstanding requests: 7
Outstanding requests: 6
Outstanding requests: 5
Outstanding requests: 4
Outstanding requests: 3
Outstanding requests: 2
Outstanding requests: 1
Outstanding requests: 0
Finished running all of the requests.
正如您所看到的,所有请求都是在解析其中任何之前发出的 - 所以如果您的连接速度较慢,或者您尝试检索大量的文档,服务器可能正在删除连接,因为它可能假设您没有检索它尝试发送的响应。代码的另一个问题是您需要在seq
中明确指定要生成的元素数量,这样可以减少代码的重复使用。
更好的解决方案是检索和解析某些消费代码所需的页面。 (如果你考虑一下,那正是F#seq
的好处。)我们首先创建一个带Uri的函数并生成seq<Async<'T>>
- 即它产生一个无限的Async<'T>
值序列,每个值都将从Uri中检索内容,解析它并返回结果。
/// Given a Uri, creates an infinite sequence of whose elements are retrieved
/// from the Uri.
let createDocumentSeq (uri : System.Uri) =
#if DEBUG
let outstandingRequestCount = ref 0
#endif
Seq.initInfinite <| fun _ ->
async {
use wc = new WebClientWithTimeout(Timeout = 5000)
wc.Headers.Add ("User-Agent", "Friendly Bot 1.0 (FriendlyBot@friendlybot.com)")
#if DEBUG
// Increment the outstanding request count just before we send the request.
do
// NOTE : The message must be created THEN passed to synchedOut.WriteLine --
// piping it (|>) into synchedOut.WriteLine or using fprintfn causes a closure
// to be created which somehow defeats the synchronization and garbles the output.
let msg =
Interlocked.Increment outstandingRequestCount
|> sprintf "Outstanding requests: %i"
synchedOut.WriteLine msg
#endif
let! html = wc.AsyncDownloadString uri
let ret =
try Some html
with ex ->
let msg = sprintf "%A" ex
synchedOut.WriteLine msg
None
#if DEBUG
// Decrement the outstanding request count now that we've
// received a reponse and parsed it.
do
let msg =
Interlocked.Decrement outstandingRequestCount
|> sprintf "Outstanding requests: %i"
synchedOut.WriteLine msg
#endif
return ret
}
现在我们使用此函数将页面检索为流:
//
let en100_Streaming =
#if DEBUG
let documentCount = ref 0
#endif
Uri ("http://en.wikipedia.org/wiki/Special:Random")
|> createDocumentSeq
|> Seq.choose (fun asyncDoc ->
Async.RunSynchronously asyncDoc
|> Option.bind (parseDoc >> parseArticle))
#if DEBUG
|> Seq.map (fun x ->
let msg =
Interlocked.Increment documentCount
|> sprintf "Parsed documents: %i"
synchedOut.WriteLine msg
x)
#endif
|> Seq.take 50
// None of the computations actually take place until
// this point, because Seq.toArray forces evaluation of the sequence.
|> Seq.toArray
如果您运行该代码,您将看到它从服务器一次提取一个结果,并且不会留下未完成的请求。此外,更改要检索的结果数量非常容易 - 您只需将传递的值更改为Seq.take
。
现在,当流代码工作正常时,它不会并行执行请求,因此对于大量文档来说可能会很慢。这是一个很容易解决的问题,尽管解决方案可能有点不直观。我们不是试图并行执行整个请求序列 - 这是原始代码中的问题 - 而是创建一个使用Async.Parallel
并行执行小型批量请求的函数,然后使用Seq.collect
将结果组合成一个平坦的序列。
/// Given a sequence of Async<'T>, creates a new sequence whose elements
/// are computed in batches of a specified size.
let parallelBatch batchSize (sequence : seq<Async<'T>>) =
sequence
|> Seq.windowed batchSize
|> Seq.collect (fun batch ->
batch
|> Async.Parallel
|> Async.RunSynchronously)
要使用此功能,我们只需对流媒体版本的代码进行一些小调整:
let en100_Batched =
let batchSize = 10
#if DEBUG
let documentCount = ref 0
#endif
Uri ("http://en.wikipedia.org/wiki/Special:Random")
|> createDocumentSeq
// Execute batches in parallel
|> parallelBatch batchSize
|> Seq.choose (Option.bind (parseDoc >> parseArticle))
#if DEBUG
|> Seq.map (fun x ->
let msg =
Interlocked.Increment documentCount
|> sprintf "Parsed documents: %i"
synchedOut.WriteLine msg
x)
#endif
|> Seq.take 50
// None of the computations actually take place until
// this point, because Seq.toArray forces evaluation of the sequence.
|> Seq.toArray
同样,您可以轻松更改要检索的文档数量,并且可以轻松修改批量大小(同样,我建议您将其保持相当小)。如果你愿意,你可以对'流'和'批处'代码进行一些调整,这样你就可以在运行时切换它们。
最后一件事 - 使用我的代码,请求不应超时,因此您可以摆脱WebClientWithTimeout
类并直接使用WebClient
。
答案 1 :(得分:2)
您的代码似乎没有做任何特别特别的事情,所以我假设维基百科不喜欢您的活动。看看他们的bot policy。深入挖掘它们似乎也有严格的User-Agent policy
截至2010年2月15日,维基媒体网站需要HTTP用户代理 所有请求的标头。这是由该作出的一项有效决定 技术人员和技术人员宣布和讨论 邮件列表。[1] [2]理由是,那些不发送邮件的客户 用户代理字符串大多是病态不良的脚本,导致很多 加载服务器,而不会使项目受益。注意 User-Agent字符串的非描述性默认值,例如used 通过Perl的libwww,也可能被阻止使用维基媒体网站 (或部分网站,如api.php)。
不发送User-Agent标头的用户代理(浏览器或脚本) 现在可能会遇到如下错误消息:
脚本应使用带有联系信息的信息性用户代理字符串,否则可能会被IP阻止,恕不另行通知。
所以我发现他们可能不喜欢你正在做的事情,即使你添加了一个合适的用户代理,但你也可以尝试一下。
wc.Headers.Add ("User-Agent", "Friendly Bot 1.0 (FriendlyBot@friendlybot.com)")
避免与服务器建立如此多的连接也不会有什么坏处。