我正在尝试创建一个webscraper,我从html文件中获取css / js / images的所有下载链接。
问题
第一个断点确实命中,但第二个断点没有击中“继续”。
我正在谈论的代码:
private static async void GetHtml(string url, string downloadDir)
{
//Get html data, create and load htmldocument
HttpClient httpClient = new HttpClient();
//This code gets executed
var html = await httpClient.GetStringAsync(url);
//This code not
Console.ReadLine();
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//Get all css download urls
var linkUrl = htmlDocument.DocumentNode.Descendants("link")
.Where(node => node.GetAttributeValue("type", "")
.Equals("text/css"))
.Select(node=>node.GetAttributeValue("href",""))
.ToList();
//Downloading css, js, images and source code
using (var client = new WebClient())
{
for (var i = 0; i <scriptUrl.Count; i++)
{
Uri uri = new Uri(scriptUrl[i]);
client.DownloadFile(uri,
downloadDir + @"\js\" + uri.Segments.Last());
}
}
修改
我从这里调用getHtml方法:
private static void Start()
{
//Create a list that will hold the names of all the subpages
List<string> subpagesList = new List<string>();
//Ask user for url and asign that to var url, also add the url to the url list
Console.WriteLine("Geef url van de website:");
string url = "https://www.hethwc.nl";
//Ask user for download directory and assign that to var downloadDir
Console.WriteLine("Geef locatie voor download:");
var downloadDir = @"C:\Users\Daniel\Google Drive\Almere\C# II\Download tests\hethwc\";
//Download and save the index file
var htmlSource = new System.Net.WebClient().DownloadString(url);
System.IO.File.WriteAllText(@"C:\Users\Daniel\Google Drive\Almere\C# II\Download tests\hethwc\index.html", htmlSource);
// Creating directories
string jsDirectory = System.IO.Path.Combine(downloadDir, "js");
string cssDirectory = System.IO.Path.Combine(downloadDir, "css");
string imagesDirectory = System.IO.Path.Combine(downloadDir, "images");
System.IO.Directory.CreateDirectory(jsDirectory);
System.IO.Directory.CreateDirectory(cssDirectory);
System.IO.Directory.CreateDirectory(imagesDirectory);
GetHtml("https://www.hethwc.nu", downloadDir);
}
答案 0 :(得分:4)
您是如何致电GetHtml
的?据推测,这是来自同步Main
方法,并且您没有任何其他非工作线程(因为您的主线程已退出):该进程将终止。类似的东西:
static void Main() {
GetHtml();
}
以上内容将在GetHtml
返回后立即终止此过程,Main
方法结束,这将在第一个不完整的await
点。
在当前的C#版本(C#7.1以后)中,您可以创建一个async Task Main()
方法,只要您更改{{1},就可以正确await
GetHtml
方法GetHtml
}}返回Task
:
async static Task Main() {
await GetHtml();
}