Question

我正在尝试使用R，RVest进行网络搜索。但是我在尝试从xml_nodeset中删除子项时遇到了一些麻烦。所以我试图webscrape的html如下：

<div id="post_message_1234">
 <div style="margin:20px; margin-top:5px; ">
  <div class="smallfont" style="margin-bottom:2px">Quote:</div>
  <table cellpadding="6" cellspacing="0" border="0" width="100%">
    <tr>
        <td class="alt2" style="border:1px inset">
            <div>
                Originally Posted by <strong>John Doe</strong>
            </div>
                This is the post inside the quote
        </td>
    </tr>
  </table>
 </div>
  This is the post outside the quote
</div>

我需要从这段HTML中得到的是“这是引用之外的帖子”，这是原帖。而我不想要的是“alt2”类中的引用帖子，“这是引用内的帖子”。
此外，每页还有多个post_messages。并且每个post_message中可以有多个引号。
我现在使用的代码能够获取每个帖子中的所有文本。但是也包含引号内的文本（我不想要的东西）。

link %>%
   read_html() %>%
   html_nodes(xpath = '//*[contains(@id, "post_message_")]') %>%
   html_text()

我怎样才能获得引号之外的文本（'这是引用之外的帖子'），最好是使用xpath？

Answer 1

删除子DIV怎么样？

<!DOCTYPE html>
<html>
   <head>
      <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
      <script>
         $(window).resize(function() {
         if(this.resizeTO) clearTimeout(this.resizeTO);
             this.resizeTO = setTimeout(function() {
               $(this).trigger('windowResize');
             }, 200); 
         });

         $(window).on('windowResize', function() {
            console.log($(window).width()); 
             var tpReportWidth = $("#tpReport").width();

             //var squareWidth = $("#square").width();
             displayWindowSize(5, tpReportWidth,"#square");

             //var rectangularWidth = $("#rectangular").width();
             displayWindowSize(3, tpReportWidth,"#rectangular");

         });

         function displayWindowSize(value, tpReportWidth, selector) {  
           var newIconWidth = Math.round(tpReportWidth/value).toFixed(2);
           console.log('iconWidth after: ' + newIconWidth);
           //$(selector).attr('height',newIconWidth);
          $(selector).attr('width',newIconWidth);
         };         
      </script>
   </head>
   <body>
      <p id="demo"></p>
      <div id="tpReport"  style="float:right;background-color:yellow; display:inline-block; width:45%;">
         <p>Try to resize the browser window to display the windows height and width.</p>
         <span  style="bottom: 60%; left: 41%;clear:both;" >
           <a href="#"> 
            <img id="rectangular" src="rectangular.png" /> 
           </a>
         </span> 

          <span  style="bottom: 10%; left: 41%;float:right;" >
           <a href="#"> 
             <img id="square" src="square.png" /> 
           </a>
         </span> 
      </div>
   </body>
</html>

查看我使用this编译器

测试的imbd的这个工作示例

link %>%
   read_html() %>%
   html_nodes(xpath = '//*[contains(@id, "post_message_")]/node()[not(self::div)]') %>%
   html_text()

我刚刚收到了＃14;乐高电影＆＃34;作为你需要的输出

Xpath删除孩子

1 个答案: