Question

在抓取多个HTML元素时，我无法指定正确的CSS路径。问题是页面的设置略有不同，因此nth-child(#)指定的元素在不同页面之间超出1。我对“单位代码”感兴趣的元素在某些页面上为nth-child(20)，在其他页面上为nth-child(21)。

我将在数百个网站上运行此操作，因此我需要弄清楚如何处理这种位置变化。此代码与nth-child(21)一起运行，并可预测地返回第二个URL的错误文本。

我正在使用包rvest。

library(rvest)
urls <- data.frame('site' = 1:2, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
                        'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'))

urls$urls <- as.character(urls$urls)

uCode<- sapply(1:length(urls[,1]), function(x)
               html(urls[x,2]) %>% 
               html_nodes(css='#wmt_content > div:nth-child(21) > p.STANDARD') %>% 
               html_text())

每个网页的html非常大，页面位于first和second。包含单元代码的HTML以及几个额外的div看起来像这样：

 <div class="UnitGuideElementItem">
    <a name="0-UNIT-CODE"></a>
    <p style="font-size: 100%;" class="BOLD">
        "Unit code"
        <br>
        "&nbsp;"
        <br>
    <p style="font-size: 100%" class="STANDARD">
        "SLE334"
        <br>
    </p>
  </div>
  <div class="UnitGuideElementItem">
    <a name="0-UNIT-TITLE"></a>
    <p style="font-size: 100%;" class="BOLD">
       "Unit title"
       <br>
       "&nbsp;"
       <br>
    <p style="font-size: 100%" class="STANDARD">
       "Medical Microbiology and Immunology"
        <br>
  </div>
  <div class="UnitGuideElementItem">
     <a name="0-CONTACT-HOURS"></a>
     <p style="font-size: 100%;" class="BOLD">
        "Contact hours"
        <br>
        "&nbsp;"
        <br>
     <p style="font-size: 100%" class="STANDARD">
        "3 x 1 hour class per week, 5 x 3 hour practicals per trimester."
     <br>
  </div>

与除0-UNIT-CODE标记中的<a>之外的其他部分相比，此HTML代码部分没有任何独特之处。通过查看w3schools page我能够访问<a>标记，但无法弄清楚如何在此节点中指定<p>兄弟节点。转到<a>标记：

uCode<- sapply(1:length(urls[,1]), function(x)
               html(urls[x,2]) %>% 
               html_nodes(css='[name$=CODE]') %>% 
               html_text())

有谁知道如何选择'相同'元素，例如当元素位置从一个页面更改为另一个页面时，来自HTML文件的名称=“0-UNIT-CODE”的兄弟姐妹？或者，如何从标记中返回信息，这些标记只能从具有相同父项的不同标记类型中找到？

编辑：包含的包名称。包含指向网站的链接，并包含更多HTML以供澄清。

Answer 1

您可以使用xpath＆＃34;以下兄弟＆＃34;：＆＃34;查找<p class=STANDARD>这是<a name=0-UNIT-CODE>的兄弟姐妹

uCode<- sapply(1:length(urls[,1]), function(x)
               html(urls[x,2]) %>% 
               html_nodes(xpath="//a[@name='0-UNIT-CODE']/following-sibling::p[@class='STANDARD']") %>% 
               html_text())

//a[@name='0-UNIT-CODE']找到<a> name="0-UNIT-CODE"（注意：我认为通常在xpath中你//a[local-name()='0-UNIT-CODE']但是这个语法似乎在这个函数中没有被理解？）
/following-sibling::p[@class='STANDARD']使用STANDARD类选择a的以下兄弟。

当nth-child（）在页面之间发生变化时指定CSS

1 个答案: