从excel中的每一行提取城市,并使用R

时间:2016-06-01 16:16:58

标签: r tweets

我提取了.csv格式的推文,数据如下:

(row 1) The latest The Admin Resources Daily! Thanks to @officerenegade @roberthalf @elliottdotorg #airfare #jobsearch
(row 2) RT @airfarewatchdog: Los Angeles #LAX to Cabo #SJD $312 nonstop on @AmericanAir for summer travel. #airfare
(row 3) RT @TheFlightDeal: #Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t. Details:
(row 4) The latest The Nasir Muhammad Daily! Thanks to @Matt_Revel @Roddee @JaeKay #lefforum #airfare
(row 5) RT @BarefootNomads: So cool! <U+2708> <U+2764><U+FE0F> #airfare deals w @Skyscanner Facebook Messenger Bot #traveldeals #cheapflights ht…
(row 6) Flights to #Oranjestad #Aruba are £169 for a 15 day trip departing Tue, Jun 7th. #airfare via @hitlist_app"

我已经利用NLP技术从推文中提取城市名称,但输出是一个城市列表,每个城市占据一行下一行。它只是识别所有城市名称并列出它。

Output: 
1   Los Angeles
2   New York
3   Mexico City
4   Mexico
5   Tue
6   London
7   New York
8   Fort Lauderdale
9   Los Angeles
10  Paris

我希望输出类似于:

1  Los Angeles Cabo (from the first tweet in row 2)
2  New York Mexico City Mexico (from the second tweet in row 3)

代码:

#Named Entity Recognition (NER)

bio <- readLines("C:\\xyz\\tweets.csv")
print(bio)

install.packages(c("NLP", "openNLP", "RWeka", "qdap"))

install.packages("openNLPmodels.en",
             repos = "http://datacube.wu.ac.at/",
             type = "source")

library(NLP)
library(openNLP)
library(RWeka)
library(qdap)
library(openNLPmodels.en)
library(magrittr)

bio <- as.String(bio)

word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()

bio_annotations <- annotate(bio, list(sent_ann, word_ann))

class(bio_annotations)
head(bio_annotations)

bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(2)
words(bio_doc) %>% head(10)

location_ann <- Maxent_Entity_Annotator(kind = "location")

pipeline <- list(sent_ann,
                 word_ann,
                 location_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

entities <- function(doc, kind) {
  s <- doc$content
  a <- annotations(doc)[[1]]
  if(hasArg(kind)) {
    k <- sapply(a$features, `[[`, "kind")
    s[a[k == kind]]
  } else {
    s[a[a$type == "entity"]]
  }
}   

entities(bio_doc, kind = "location")
cities <- entities(bio_doc, kind = "location")

library(xlsx)
write.xlsx(cities, "C:\\xyz\\xyz.xlsx")

还有一种方法可以进一步将城市分为原点和目的地,即通过在城市之前对城市进行分类。或者&#39; - &#39;作为原始城市,其余作为目的地城市?

0 个答案:

没有答案