我想抓一个维基页面。具体而言,this one.
我的应用程序将允许用户输入车辆的注册号(例如,SBS8988Z),它将显示相关信息(位于页面本身)。
例如,如果用户在我的应用程序的文本字段中输入SBS8988Z,它应该在该Wiki页面上查找该行
SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
并返回SBS8988Z(SLBP 192/194 *) - F& N NutriSoy鲜奶:新加坡第一大豆奶! (第二代)。
到目前为止我的代码是(从各种网站复制和编辑)......
WebClient getdeployment = new WebClient();
string url = "http://sgwiki.com/wiki/Scania_K230UB_(Batch_1_Euro_V)";
getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!
HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;
List<string> anchorTags = new List<string>();
foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
{
string att = deployment.OuterHtml;
anchorTags.Add(att);
}
但是,我得到一个ArgumentException
未处理 - 路径中的非法字符。
代码有什么问题?有更简单的方法吗?我正在使用HtmlAgilityPack,但如果有更好的解决方案,我很乐意遵守。
答案 0 :(得分:2)
代码有什么问题?坦率地说,一切。 :P
页面的格式不符合您的阅读方式。你无法希望以这种方式获得所需的内容。
页面内容(我们感兴趣的部分)看起来像这样:
<h2>
<span id="Deployments" class="mw-headline">Deployments</span>
</h2>
<p>
<!-- ... -->
<b>SBS8987B</b>
(SLBP 192/194*)
<br>
<b>SBS8988Z</b>
(SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
<br>
<b>SBS8989X</b>
(SLBP SP)
<br>
<!-- ... -->
</p>
基本上我们需要找到包含我们正在寻找的注册号的b
元素。找到该元素后,获取文本并将其组合在一起以形成结果。这是代码:
static string GetVehicleInfo(string reg)
{
var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";
// HtmlWeb is a helper class to get pages from the web
var web = new HtmlAgilityPack.HtmlWeb();
// Create an HtmlDocument from the contents found at given url
var doc = web.Load(url);
// Create an XPath to find the `b` elements which contain the registration numbers
var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
+ "/following-sibling::p[1]" // move to the first `p` element (where the actual content is in) after the header
+ "/b"; // select the `b` elements
// Get the elements from the specified XPath
var deployments = doc.DocumentNode.SelectNodes(xpath);
// Create a LINQ query to find the requested registration number and generate a result
var query =
from b in deployments // from the list of registration numbers
where b.InnerText == reg // find the registration we're looking for
select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)
// The query should yield exactly one result (or we have a problem) or none (null)
var content = query.SingleOrDefault();
// Decode the content (to convert stuff like "&" to "&")
var decoded = System.Net.WebUtility.HtmlDecode(content);
return decoded;
}