R htmlTreeParse部分不需要的翻译

时间:2014-10-01 00:18:44

标签: r parsing proxy language-translation

所以,我在意大利,并在R中的imdb上播放'最佳电影'奥斯卡列表。运行此代码:

library(XML)
fileUrl <- "http://www.imdb.com/search/title?           
count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc_3"
doc <- htmlTreeParse(fileUrl,useInternal=TRUE)
scores <- xpathSApply(doc,"//td[@class='title']",xmlValue)
head(scores,2)

产生以下输出:

[1] "\n    \n\n\n\n    12 anni schiavo\n    (2013)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n8.2/10\nX\n \n\n\nIn the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.\n\n    Dir: Steve McQueen\n    With: Chiwetel Ejiofor, Michael K. Williams, Michael Fassbender\n\n    Biography | Drama | History\n    \n    134 mins.\n"                                                       
[2] "\n    \n\n\n\n    Argo\n    (2012)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n7.8/10\nX\n \n\n\nActing under the cover of a Hollywood producer scouting a location for a science fiction film, a CIA agent launches a dangerous operation to rescue six Americans in Tehran during the U.S. hostage crisis in Iran in 1980.\n\n    Dir: Ben Affleck\n    With: Ben Affleck, Bryan Cranston, John Goodman\n\n    Drama | Thriller\n    \n    120 mins.\n"
[3] "\n    \n\n\n\n    The Artist\n    (2011)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n8.0/10\nX\n \n\n\nA silent movie star meets a young dancer, but the arrival of talking pictures sends their careers in opposite directions.\n\n    Dir: Michel Hazanavicius\n    With: Jean Dujardin, Bérénice Bejo, John Goodman\n\n    Comedy | Drama | Romance\n    \n    100 mins.\n"    

在换行后查看第一个字段...注意电影1的方法,名称是否翻译成意大利语(英文名称是'12 Years a Slave'),而对于电影3,只给出了英文?快进一点,这里是一个片段,只是为了给出一个想法(省略中间步骤):

> head(scores.df[,1],10)
 [1] "12 anni schiavo"                  "Argo"                            
 [3] "The Artist"                       "Il discorso del re"              
 [5] "The Hurt Locker"                  "The Millionaire"                 
 [7] "Non è un paese per vecchi"        "The Departed - Il bene e il male"
 [9] "Million Dollar Baby"              "Crash: Contatto fisico"  

我确实运行了一个网络代理,所以当我访问Chrome网站时它自然会给我所有的英文,但即使在隐身模式和Internet Explorer中它也提供了所有英文,所以为什么它会部分翻译一些标题和我怎么强迫它停止?

谢谢!

1 个答案:

答案 0 :(得分:5)

虽然IMDB必须根据原始IP来假设您的请求。您可能已经在Chrome中设置了默认的区域设置来请求en-US版本的网页,或者您的代理有更多的英语&#34;查看IP,但htmlTreeParse的文件传输机制不使用相同的机制来下载文件。我没有看到任何明显的方法来更改XML库使用的标头。但是,这是一个使用httr库来帮助处理HTTP请求的版本

library(XML)
library(httr)
fileUrl <- "http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc_3"
en<-content(GET(fileUrl, add_headers("Accept-Language"="en-US;en")))
it<-content(GET(fileUrl, add_headers("Accept-Language"="it-it;it")))

现在我们可以比较结果

head(xpathSApply(en,"//td[@class='title']//a[1]", xmlValue))
# [1] "12 Years a Slave"    "Argo"                "The Artist"          
# [4] "The King's Speech"   "The Hurt Locker"     "Slumdog Millionaire"

head(xpathSApply(it,"//td[@class='title']//a[1]", xmlValue))
# [1] "12 anni schiavo"    "Argo"               "The Artist"         
# [4] "Il discorso del re" "The Hurt Locker"    "The Millionaire"

因此我们可以看到IMDB遵循请求标头中请求的语言。