Google网址抓取不会在列表框中循环显示行

时间:2019-01-14 07:54:48

标签: c# backgroundworker

我正在尝试从Google搜索中抓取结果,该工具需要一页一页地浏览。但是,问题在于它没有从列表框中获取所有列表。它仅适用于列表框的第一行。

Startbtn代码

  foreach (string url in urlList.Items)
            {
                webBrowser1.Navigate("https://www.google.com/search?q=" + url);
                await PageLoad(30, 5);
                MessageBox.Show("sdsaD3");
                string pageSource = webBrowser1.DocumentText;
                Scrape(pageSource);
            }

- 刮法

  private async void Scrape(string pageSource)
        {
                string regexExpression = "(?<=><div class=\"rc\"><div class=\"r\"><a href=\")(.*?)(?=\" onmousedown=)";
                Regex match = new Regex(regexExpression, RegexOptions.Singleline);
                MatchCollection collection = Regex.Matches(pageSource, regexExpression);
                for (int i = 0; i < collection.Count; i++)
                {
                CommonCodes.WriteToTxt(collection[i].ToString(), "googlescrapedurls.txt");
                if (i == collection.Count - 1)
                {
                    var elementid = webBrowser1.Document.GetElementById("pnnext");
                    if (elementid != null)
                    {
                        for (int w = 0; w < 1; w++)
                        {
                            BackgroundWorker worker = new BackgroundWorker();
                            worker.DoWork += new DoWorkEventHandler(backgroundWorker1_DoWork);
                            worker.RunWorkerAsync(w);
                        }
                    }
                    else if(webBrowser1.Document.GetElementById("pnnext") == null)
                    {
                          for(int pg=0; pg< urlList.Items.Count; pg++)
                    {
                      webBrowser1.Navigate("https://www.google.com/search?q=" + urlList.Items[pg+1]);
                        CommonCodes.WaitXSeconds(10);
                        //await PageLoad(30, 5);
                        Scrape(webBrowser1.DocumentText);
                    }
                    }
                }


             }

-

Background worker code:
       BackgroundWorker backgroundWorker = sender as BackgroundWorker;
            webBrowser1.Invoke(new Action(() => { gCaptcha(); }));
            webBrowser1.Invoke(new Action(() => { webBrowser1.Document.GetElementById("pnnext").InvokeMember("Click"); }));
            await PageLoad(30, 5);
            webBrowser1.Invoke(new Action(() => { Scrape(webBrowser1.DocumentText); }));

页面加载代码

try
{
    TaskCompletionSource<bool> PageLoaded = null;
    PageLoaded = new TaskCompletionSource<bool>();
    int TimeElapsed = 0;
    webBrowser1.DocumentCompleted += (s, e) =>
    {
        if (webBrowser1.ReadyState != WebBrowserReadyState.Complete) return;
        if (PageLoaded.Task.IsCompleted) return; PageLoaded.SetResult(true);
    };
    //
    while (PageLoaded.Task.Status != TaskStatus.RanToCompletion)
    {
        await Task.Delay(delay * 1000);//interval of 10 ms worked good for me
        TimeElapsed++;
        if (TimeElapsed >= TimeOut * 100) PageLoaded.TrySetResult(true);
    }
}
catch (Exception ex)
{
    CommonCodes.WriteLog(ex.ToString());
    MessageBox.Show(ex.Message);
}

- 主要问题是当我在列表框中有5行时,仅第一行会转到每个页面并抓取网址,而其他行则无法正常工作。我不理解代码中的问题。一些代码

MessageBox.Show("sdsaD3");

执行多次(如果列表框中有5行,则此消息框将弹出5次)。感谢您的帮助。

EDit:我发现了问题,看来问题出在等待PageLoad(30,5);但我不确定如何调用异步方法。有人知道吗?

0 个答案:

没有答案