Question

我正试图从以下网站中撤出：http://goodcompanies.com/company/31-bits/

我要抓取"Most <$100"的值：

<div class="company-info-section no-flex-grow no-flex-shink">
    <h4 class="all-caps title-line-right no-margin company-section-title">
        <span>Price Range</span>
    </h4>
    <b>Most <$100</b>
</div>

我正在使用的代码：

   html <- read_html(http://goodcompanies.com/company/31-bits/)

   info <- html %>%
       html_nodes('.company-info-section') %>%
       html_text() %>%
       .[1]

我得到："\n\t\t\n\t\t\tPrice Range\n\t\t\n\t\tMost \n\t"

但是我想要并且应该得到的是："\n\t\t\n\t\t\tPrice Range\n\t\t\n\t\tMost < $100\n\t"

似乎事实是，在实际的HTML中，<和$之间没有空格导致了此问题。我该如何解决？

Answer 1

空间不是问题；实际的问题是该网站仅使用了无效的HTML：必须转义HTML代码中的<（例如，<），而网站没有这样做。

不幸的是，rvest似乎无法有效处理无效的HTML。最好的解决方案是找到可以处理杂乱/无效HTML的HTML解析器。不幸的是，我对R一无所知。

一种骇人听闻的解决方案是将页面下载到字符串中，解决问题（即执行gubs('<$', '<$', page_text)或类似操作），然后然后将其传递给rvest。

rvest不会在“ <”之后提取文本，即使它是字符串的一部分

1 个答案: