Question

我正在尝试提取＆＃34;来源＆＃34;来自这种类型的样本

sample : { "_id" : { "$oid" : "53b16540f4f0687e95da8843" }, "text" : "FCH urge a la FIFA el uso de tecnologÃa en decisiones:  2-1 fuel marcador final entre Holanda y MÃ©xico, que si... http://t.co/wiyp8AIxBO", "created_at" : 1404114920, "geo" : null, "source" : "twitterfeed" }

它是一个连续的Json feed，连续文本中包含数百万个这样的样本，我使用JSON（jsonlite包）来导入数据。

pat<-".*?\"source\":\"(.*?).*?\""
l<-regexpr(pattern=pat,text=df$source)
regmatches(df$source,l)

[1] "twitterfeed\"}{\"_id\":{\"$oid\":\"53b16540f4f0687e95da8844\"},\"text\":\"RT @fifaworldcup_es: #Brasil2014 | Â¡Oct............**(complete text)**

我也尝试过gregexpr，但它也会在每一行的末尾用我的搜索结果打印完整的文本。

pat<-".*?\"source\":\"(.*?).*?\""
poo<-gsub(pat,"\\1",df$source)
poo
[1] "Twitter for AndroidIFTTTTwitter for iPhonetwitterfeedTwitter for iPhoneTwitter for AndroidTwitter Web ClientTwitter for AndroidTwitter for iPhone...**(correct answers)**

但是我无法将这个结果分开，我尝试了gsubfn包

poo<-strapplyc(df$source,pat,simplify=rbind)
poo

 [,1] [,2] [,3] [,4]
 ""    ""    ""    ""

请告诉我为什么regmatches的行为与gsub相同，因为相同的模式＆amp; strapplyc无法检测到这种模式，但gsub可以。

gsub和gregexpr所需的模式有什么区别？

0 个答案: