所以,我在意大利,并在R中的imdb上播放'最佳电影'奥斯卡列表。运行此代码:
library(XML)
fileUrl <- "http://www.imdb.com/search/title?
count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc_3"
doc <- htmlTreeParse(fileUrl,useInternal=TRUE)
scores <- xpathSApply(doc,"//td[@class='title']",xmlValue)
head(scores,2)
产生以下输出:
[1] "\n \n\n\n\n 12 anni schiavo\n (2013)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n8.2/10\nX\n \n\n\nIn the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.\n\n Dir: Steve McQueen\n With: Chiwetel Ejiofor, Michael K. Williams, Michael Fassbender\n\n Biography | Drama | History\n \n 134 mins.\n"
[2] "\n \n\n\n\n Argo\n (2012)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n7.8/10\nX\n \n\n\nActing under the cover of a Hollywood producer scouting a location for a science fiction film, a CIA agent launches a dangerous operation to rescue six Americans in Tehran during the U.S. hostage crisis in Iran in 1980.\n\n Dir: Ben Affleck\n With: Ben Affleck, Bryan Cranston, John Goodman\n\n Drama | Thriller\n \n 120 mins.\n"
[3] "\n \n\n\n\n The Artist\n (2011)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n8.0/10\nX\n \n\n\nA silent movie star meets a young dancer, but the arrival of talking pictures sends their careers in opposite directions.\n\n Dir: Michel Hazanavicius\n With: Jean Dujardin, Bérénice Bejo, John Goodman\n\n Comedy | Drama | Romance\n \n 100 mins.\n"
在换行后查看第一个字段...注意电影1的方法,名称是否翻译成意大利语(英文名称是'12 Years a Slave'),而对于电影3,只给出了英文?快进一点,这里是一个片段,只是为了给出一个想法(省略中间步骤):
> head(scores.df[,1],10)
[1] "12 anni schiavo" "Argo"
[3] "The Artist" "Il discorso del re"
[5] "The Hurt Locker" "The Millionaire"
[7] "Non è un paese per vecchi" "The Departed - Il bene e il male"
[9] "Million Dollar Baby" "Crash: Contatto fisico"
我确实运行了一个网络代理,所以当我访问Chrome网站时它自然会给我所有的英文,但即使在隐身模式和Internet Explorer中它也提供了所有英文,所以为什么它会部分翻译一些标题和我怎么强迫它停止?
谢谢!
答案 0 :(得分:5)
虽然IMDB必须根据原始IP来假设您的请求。您可能已经在Chrome中设置了默认的区域设置来请求en-US
版本的网页,或者您的代理有更多的英语&#34;查看IP,但htmlTreeParse
的文件传输机制不使用相同的机制来下载文件。我没有看到任何明显的方法来更改XML
库使用的标头。但是,这是一个使用httr
库来帮助处理HTTP请求的版本
library(XML)
library(httr)
fileUrl <- "http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc_3"
en<-content(GET(fileUrl, add_headers("Accept-Language"="en-US;en")))
it<-content(GET(fileUrl, add_headers("Accept-Language"="it-it;it")))
现在我们可以比较结果
head(xpathSApply(en,"//td[@class='title']//a[1]", xmlValue))
# [1] "12 Years a Slave" "Argo" "The Artist"
# [4] "The King's Speech" "The Hurt Locker" "Slumdog Millionaire"
head(xpathSApply(it,"//td[@class='title']//a[1]", xmlValue))
# [1] "12 anni schiavo" "Argo" "The Artist"
# [4] "Il discorso del re" "The Hurt Locker" "The Millionaire"
因此我们可以看到IMDB遵循请求标头中请求的语言。