使用C#自动下载Web内容

时间:2015-11-30 17:20:36

标签: c# html visual-studio

我想开发一个程序来自动查找朗文在线词典中的单词并复制其定义和含义。我正在使用visual studio和C#语言,我开发了浏览网站并搜索单词的部分。但是,当有一些单词形式时,问题在于浏览朗文在线网站。例如,对于this link,建议单词的html代码如下:

    <div class="content1">
  <style>
    .dictionary-results-title .topic_bullet {
      margin: 0px;
    }
  </style>
    <div class="border-search">
        <div class="dictionary-results-title">
            Results from the Longman Dictionary of Contemporary English:
        </div>

        <div class="dictionary-results-title">
            <span class="dictionary-results-title-topic-new">
                Click on topic labels to navigate through our Topic Dictionary
            </span>
        </div>
          <!-- google_ad_section_start -->
        <div id="42385" class="folded">
            <table id="hwdfolded" class="hwdfolded" cellspacing="0" cellpadding="0"> 
                <tr>  
                    <td class="hwdunSelHG"></td> 
                    <td class="hwdunSelHM"></td> 
                    <td class="hwdunSelHD"></td>
                </tr> 
                <tr>  
                    <td class="hwdunSelMG"></td> 
                    <td class="hwdunSelMM">
                        <a href="/dictionary/superman">
                        <span class="headword">superman</span></a> 
                        <span class="homographs"></span> 
                        <span class="wordclass">noun</span>
                        <span class="topiclinks"></span>
                    </td> 
                    <td class="hwdunSelMD"></td>
                </tr> 
                <tr>    
                    <td class="hwdunSelBG"></td> 
                    <td class="hwdunSelBM"></td> 
                    <td class="hwdunSelBD"></td>
                </tr>
            </table>
        </div> 
        <div id="42386" class="folded">
            <table id="hwdfolded" class="hwdfolded" cellspacing="0" cellpadding="0"> 
                <tr>  
                    <td class="hwdunSelHG"></td> 
                    <td class="hwdunSelHM"></td> 
                    <td class="hwdunSelHD"></td>
                </tr> 
                <tr>  
                    <td class="hwdunSelMG"></td> 
                    <td class="hwdunSelMM">
                        <a href="/dictionary/Superman">
                        <span class="headword">Superman</span></a> 
                        <span class="homographs"></span> 
                        <span class="wordclass"></span>
                        <span class="topiclinks"></span>
                    </td> 
                    <td class="hwdunSelMD"></td>
                </tr> 
                <tr>  
                    <td class="hwdunSelBG"></td> 
                    <td class="hwdunSelBM"></td> 
                    <td class="hwdunSelBD"></td>
                </tr>
            </table>
        </div>
        <script language="JavaScript" type="text/javascript"> 
            parent.curEntryId=42385; parent.prevEntryId=42385; parent.nextEntryId=42385; 
            parent.gsSenseId=null; parent.giPhrId=null; 
        </script>
    </div>
</div>

我找到了查找id="42385"id="42386"等字词ID的方法,但我无法浏览它们。每个元素内都有一个带有这些ID的表。正如您在html代码中看到的那样,表格第二行的第二个数据包含每个单词的链接。 我写的点击它们的代码是这样的:

HtmlElement Word = webBrowser1.Document.GetElementById("hwdfolded");
foreach (HtmlElement ele in Word.Parent.Parent.Children)
{                
    if (ele.Id != null && ele.InnerText.ToLower().Contains(Stword))
    {
        HtmlElement clickon = webBrowser1.Document.GetElementById(ele.Id);
        clickon.InvokeMember("click");
        //ele.InvokeMember("click");
        while (webBrowser1.ReadyState != WebBrowserReadyState.Interactive)
            Application.DoEvents();
        do
        {
            Application.DoEvents();
        } while (webBrowser1.ReadyState != WebBrowserReadyState.Complete);
        break;
    }
}

请注意,Stword包含我在搜索的单词的字符串,在此示例中它包含“superman”,ele.Id包含一个指定的ID,我在调试模式下检查了它。但是click命令不起作用。如果您能告诉我解决方案或给我另一个更好的解决方案,我将不胜感激。

1 个答案:

答案 0 :(得分:1)

我建议你使用抓取工具来执行页面导航。使用Selenium,通过XPATH获取元素并浏览它们并获取其中的文本非常容易。希望它有所帮助。