Streamwriter.WriteLine()不是写一切。奇怪的输出

时间:2014-11-14 15:54:21

标签: c# web-scraping html-agility-pack

我正在编写一个程序,以便链接到我的大学教师的bios页面。我正在使用HTMLAgilityPack。这是我的代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.IO;

namespace Get_Professor_Data
{
    class Program
    {
        static void Main(string[] args)
        {
            FileStream fs = new FileStream("Links.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite);
            string url, previousurl = "";
            char c = '@';
            StreamWriter writer = new StreamWriter(fs);
            HtmlWeb web = new HtmlWeb();
            for (int i = 0; i < 26; i++)
            {
                HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
                foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
                {
                    c++;
                    url = link.Attributes["href"].Value.ToString();
                    //if (url == previousurl)
                    //    continue;
                    try
                    {
                        if (url.Substring(0, 25).Equals(@"/facultybios/profile.php?", StringComparison.Ordinal))
                        {
                            writer.WriteLine(@"https://www2.aus.edu" + url);
                            writer.Flush();
                        }
                    }
                    catch (Exception ex)
                    {
                    }
                    previousurl = url;
                }
            }
            writer.Close();
        }
    }
}

这是我的输出:

https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=jabdalla
https://www2.aus.edu/facultybios/profile.php?faculty=jsater
https://www2.aus.edu/facultybios/profile.php?faculty=jgriffin
https://www2.aus.edu/facultybios/profile.php?faculty=jfedtke
https://www2.aus.edu/facultybios/profile.php?faculty=jyounas
https://www2.aus.edu/facultybios/profile.php?faculty=jsqualli
https://www2.aus.edu/facultybios/profile.php?faculty=jboisvert
https://www2.aus.edu/facultybios/profile.php?faculty=jvinke
https://www2.aus.edu/facultybios/profile.php?faculty=jbaker
https://www2.aus.edu/facultybios/profile.php?faculty=jhassan
https://www2.aus.edu/facultybios/profile.php?faculty=jpalmer
https://www2.aus.edu/facultybios/profile.php?faculty=jkolo
https://www2.aus.edu/facultybios/profile.php?faculty=jmarch
https://www2.aus.edu/facultybios/profile.php?faculty=jinhyuk
https://www2.aus.edu/facultybios/profile.php?faculty=giesen
https://www2.aus.edu/facultybios/profile.php?faculty=jvangorp
https://www2.aus.edu/facultybios/profile.php?faculty=jswanstrom
https://www2.aus.edu/facultybios/profile.php?faculty=jking
https://www2.aus.edu/facultybios/profile.php?faculty=jmontague
https://www2.aus.edu/facultybios/profile.php?faculty=jallee
https://www2.aus.edu/facultybios/profile.php?faculty=jkatsos
https://www2.aus.edu/facultybios/profile.php?faculty=jbley
https://www2.aus.edu/facultybios/profile.php?faculty=jwallis
https://www2.aus.edu/facultybios/profile.php?faculty=jgibbs
https://www2.aus.edu/facultybios/profile.php?faculty=jroldan
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https

由于某些奇怪的原因,只打印J页面的链接。有些链接是空的。最后一行只有https(这就是为什么我认为问题在于编写器而不是我的代码的逻辑)。我一直试图解决这个问题一段时间没有运气。

这些是我正在抓取的页面:https://www2.aus.edu/facultybios/

任何帮助都将不胜感激。

1 个答案:

答案 0 :(得分:0)

我同意Jon的观察结果100%:你根本不需要捕捉异常(相反,只需在调用Substring()之前检查长度!),但是你确定只应该捕获的异常期待得到。你应该使用using处理FileStream对象和StreamWriter对象的处理(从技术上讲,后者为你处理前者,但恕我直言,这是明白的好事。)< / p>

至于实际问题,在我看来,有一个明显的错误,一个可能的错误:

  • 显而易见的错误是你在错误的范围内递增c(你用来选择要抓取哪个页面的变量)。也就是说,您为每个处理的URL增加一次值。据推测,你实际上想要在循环之前增加该变量,而不是在它内部。

即。而不是这个:

HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
    c++;
你可能想写这个:

HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
c++;
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{

或者甚至可以这样:

HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + (c++));
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
  • 可能的错误是您将c初始化为字符@。我没有看到该页面上的任何内容表明这将是一个有效的角色使用;只有当sort参数设置为AZ的字母(不区分大小写)时,它才会显示链接。

考虑到所有这些,恕我直言,编写此代码的更好方法是这样的:

using (FileStream fs = new FileStream("Links.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite)_
using (StreamWriter writer = new StreamWriter(fs))
{
    string url;
    HtmlWeb web = new HtmlWeb();
    for (int i = 0; i < 26; i++)
    {
        char c = (char)('A' + i);
        HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
        {
            url = link.Attributes["href"].Value.ToString();
            if (url.Length > 25 &&
                url.Substring(0, 25).Equals(@"/facultybios/profile.php?", StringComparison.Ordinal))
            {
                writer.WriteLine(@"https://www2.aus.edu" + url);
                writer.Flush();
            }
        }
    }
}