Question

您好StackOverflow用户，

抱歉这个愚蠢的问题。

我的问题有点笼统，但这是一个例子：假设我在美国城市的官方网页上抓取维基百科的信息框信息。因此，对于给定的维基百科URL列表，我需要最后一行信息框（页面右侧的框）和网站上的信息。在Python中，我将以这种方式完成它。但是，我无法理解如何在R中这样做。所以

r = requests.get("https://en.wikipedia.org/wiki/Los_Angeles")
if r:
    text = r.text
soup = BeautifulSoup(text, 'lxml')
def get_website(soup):
    for tr in soup.find("table", 
                        class_="infobox")("tr"):
        if tr.th and 'Website' in tr.th.text:
            print(tr.td)
            s = tr.td.p.string
            return (s)

Answer 1

Python和Python都有更好的方法。 R通过XPath。

Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=172.16.0.167/172.16.0.167:55622]

我假设你真的想要最后library(rvest) pg <- read_html("https://en.wikipedia.org/wiki/Los_Angeles") html_node(pg, xpath=".//table[contains(@class,'infobox') and tr[contains(., 'Website')]]/tr[last()]/td//a") -> last_row_link html_text(last_row_link) ## [1] "Official website" html_attr(last_row_link, "href") ## [1] "https://www.lacity.org/"中链接的href属性，但XPath中的<tr>表达式是必不可少的成分。最后las()说（基本上）“一旦在我们找到的td//a中找到<td>，请查看元素子树中的位置和锚标记”

Answer 2

您想要的td或者是否有特定的标识符？

但是，如果您希望tr的{{1}}元素与类table类似于您的代码，那么我会这样做：

infobox

或者如果你喜欢单行

require(rvest)

# read the webpage
webpage <- read_html("https://en.wikipedia.org/wiki/Los_Angeles")

# extract the url-link element of table with class infobox
your_infobox_tr <- webpage %>% html_nodes(".infobox") %>% html_nodes(".url>a")

# extract the href link content
your_href <- your_infobox_tr %>% html_attr(name='href')

仅供参考：如果您不知道your_wanted_link <- read_html("https://en.wikipedia.org/wiki/Los_Angeles") %>% html_nodes(".infobox") %>% html_nodes(".url>a") %>% html_attr(name="href")是什么，它是一个管道运营商，可以通过安装%>%包获得。

rvest scraping，获得特定的td（来自Python的tranlsation）

2 个答案: