我正在编写一个程序,以便链接到我的大学教师的bios页面。我正在使用HTMLAgilityPack。这是我的代码:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.IO;
namespace Get_Professor_Data
{
class Program
{
static void Main(string[] args)
{
FileStream fs = new FileStream("Links.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite);
string url, previousurl = "";
char c = '@';
StreamWriter writer = new StreamWriter(fs);
HtmlWeb web = new HtmlWeb();
for (int i = 0; i < 26; i++)
{
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
c++;
url = link.Attributes["href"].Value.ToString();
//if (url == previousurl)
// continue;
try
{
if (url.Substring(0, 25).Equals(@"/facultybios/profile.php?", StringComparison.Ordinal))
{
writer.WriteLine(@"https://www2.aus.edu" + url);
writer.Flush();
}
}
catch (Exception ex)
{
}
previousurl = url;
}
}
writer.Close();
}
}
}
这是我的输出:
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=jabdalla
https://www2.aus.edu/facultybios/profile.php?faculty=jsater
https://www2.aus.edu/facultybios/profile.php?faculty=jgriffin
https://www2.aus.edu/facultybios/profile.php?faculty=jfedtke
https://www2.aus.edu/facultybios/profile.php?faculty=jyounas
https://www2.aus.edu/facultybios/profile.php?faculty=jsqualli
https://www2.aus.edu/facultybios/profile.php?faculty=jboisvert
https://www2.aus.edu/facultybios/profile.php?faculty=jvinke
https://www2.aus.edu/facultybios/profile.php?faculty=jbaker
https://www2.aus.edu/facultybios/profile.php?faculty=jhassan
https://www2.aus.edu/facultybios/profile.php?faculty=jpalmer
https://www2.aus.edu/facultybios/profile.php?faculty=jkolo
https://www2.aus.edu/facultybios/profile.php?faculty=jmarch
https://www2.aus.edu/facultybios/profile.php?faculty=jinhyuk
https://www2.aus.edu/facultybios/profile.php?faculty=giesen
https://www2.aus.edu/facultybios/profile.php?faculty=jvangorp
https://www2.aus.edu/facultybios/profile.php?faculty=jswanstrom
https://www2.aus.edu/facultybios/profile.php?faculty=jking
https://www2.aus.edu/facultybios/profile.php?faculty=jmontague
https://www2.aus.edu/facultybios/profile.php?faculty=jallee
https://www2.aus.edu/facultybios/profile.php?faculty=jkatsos
https://www2.aus.edu/facultybios/profile.php?faculty=jbley
https://www2.aus.edu/facultybios/profile.php?faculty=jwallis
https://www2.aus.edu/facultybios/profile.php?faculty=jgibbs
https://www2.aus.edu/facultybios/profile.php?faculty=jroldan
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https
由于某些奇怪的原因,只打印J页面的链接。有些链接是空的。最后一行只有https(这就是为什么我认为问题在于编写器而不是我的代码的逻辑)。我一直试图解决这个问题一段时间没有运气。
这些是我正在抓取的页面:https://www2.aus.edu/facultybios/
任何帮助都将不胜感激。
答案 0 :(得分:0)
我同意Jon的观察结果100%:你根本不需要捕捉异常(相反,只需在调用Substring()
之前检查长度!),但是你确定只应该捕获的异常期待得到。你应该使用using
处理FileStream
对象和StreamWriter
对象的处理(从技术上讲,后者为你处理前者,但恕我直言,这是明白的好事。)< / p>
至于实际问题,在我看来,有一个明显的错误,一个可能的错误:
c
(你用来选择要抓取哪个页面的变量)。也就是说,您为每个处理的URL增加一次值。据推测,你实际上想要在循环之前增加该变量,而不是在它内部。即。而不是这个:
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
c++;
你可能想写这个:
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
c++;
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
或者甚至可以这样:
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + (c++));
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
c
初始化为字符@
。我没有看到该页面上的任何内容表明这将是一个有效的角色使用;只有当sort
参数设置为A
到Z
的字母(不区分大小写)时,它才会显示链接。考虑到所有这些,恕我直言,编写此代码的更好方法是这样的:
using (FileStream fs = new FileStream("Links.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite)_
using (StreamWriter writer = new StreamWriter(fs))
{
string url;
HtmlWeb web = new HtmlWeb();
for (int i = 0; i < 26; i++)
{
char c = (char)('A' + i);
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
url = link.Attributes["href"].Value.ToString();
if (url.Length > 25 &&
url.Substring(0, 25).Equals(@"/facultybios/profile.php?", StringComparison.Ordinal))
{
writer.WriteLine(@"https://www2.aus.edu" + url);
writer.Flush();
}
}
}
}