我正在尝试从Google搜索中抓取结果,该工具需要一页一页地浏览。但是,问题在于它没有从列表框中获取所有列表。它仅适用于列表框的第一行。
Startbtn代码
foreach (string url in urlList.Items)
{
webBrowser1.Navigate("https://www.google.com/search?q=" + url);
await PageLoad(30, 5);
MessageBox.Show("sdsaD3");
string pageSource = webBrowser1.DocumentText;
Scrape(pageSource);
}
- 刮法
private async void Scrape(string pageSource)
{
string regexExpression = "(?<=><div class=\"rc\"><div class=\"r\"><a href=\")(.*?)(?=\" onmousedown=)";
Regex match = new Regex(regexExpression, RegexOptions.Singleline);
MatchCollection collection = Regex.Matches(pageSource, regexExpression);
for (int i = 0; i < collection.Count; i++)
{
CommonCodes.WriteToTxt(collection[i].ToString(), "googlescrapedurls.txt");
if (i == collection.Count - 1)
{
var elementid = webBrowser1.Document.GetElementById("pnnext");
if (elementid != null)
{
for (int w = 0; w < 1; w++)
{
BackgroundWorker worker = new BackgroundWorker();
worker.DoWork += new DoWorkEventHandler(backgroundWorker1_DoWork);
worker.RunWorkerAsync(w);
}
}
else if(webBrowser1.Document.GetElementById("pnnext") == null)
{
for(int pg=0; pg< urlList.Items.Count; pg++)
{
webBrowser1.Navigate("https://www.google.com/search?q=" + urlList.Items[pg+1]);
CommonCodes.WaitXSeconds(10);
//await PageLoad(30, 5);
Scrape(webBrowser1.DocumentText);
}
}
}
}
-
Background worker code:
BackgroundWorker backgroundWorker = sender as BackgroundWorker;
webBrowser1.Invoke(new Action(() => { gCaptcha(); }));
webBrowser1.Invoke(new Action(() => { webBrowser1.Document.GetElementById("pnnext").InvokeMember("Click"); }));
await PageLoad(30, 5);
webBrowser1.Invoke(new Action(() => { Scrape(webBrowser1.DocumentText); }));
页面加载代码
try
{
TaskCompletionSource<bool> PageLoaded = null;
PageLoaded = new TaskCompletionSource<bool>();
int TimeElapsed = 0;
webBrowser1.DocumentCompleted += (s, e) =>
{
if (webBrowser1.ReadyState != WebBrowserReadyState.Complete) return;
if (PageLoaded.Task.IsCompleted) return; PageLoaded.SetResult(true);
};
//
while (PageLoaded.Task.Status != TaskStatus.RanToCompletion)
{
await Task.Delay(delay * 1000);//interval of 10 ms worked good for me
TimeElapsed++;
if (TimeElapsed >= TimeOut * 100) PageLoaded.TrySetResult(true);
}
}
catch (Exception ex)
{
CommonCodes.WriteLog(ex.ToString());
MessageBox.Show(ex.Message);
}
- 主要问题是当我在列表框中有5行时,仅第一行会转到每个页面并抓取网址,而其他行则无法正常工作。我不理解代码中的问题。一些代码
MessageBox.Show("sdsaD3");
执行多次(如果列表框中有5行,则此消息框将弹出5次)。感谢您的帮助。
EDit:我发现了问题,看来问题出在等待PageLoad(30,5);但我不确定如何调用异步方法。有人知道吗?