我提取了.csv格式的推文,数据如下:
(row 1) The latest The Admin Resources Daily! Thanks to @officerenegade @roberthalf @elliottdotorg #airfare #jobsearch
(row 2) RT @airfarewatchdog: Los Angeles #LAX to Cabo #SJD $312 nonstop on @AmericanAir for summer travel. #airfare
(row 3) RT @TheFlightDeal: #Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t. Details:
(row 4) The latest The Nasir Muhammad Daily! Thanks to @Matt_Revel @Roddee @JaeKay #lefforum #airfare
(row 5) RT @BarefootNomads: So cool! <U+2708> <U+2764><U+FE0F> #airfare deals w @Skyscanner Facebook Messenger Bot #traveldeals #cheapflights ht…
(row 6) Flights to #Oranjestad #Aruba are £169 for a 15 day trip departing Tue, Jun 7th. #airfare via @hitlist_app"
我已经利用NLP技术从推文中提取城市名称,但输出是一个城市列表,每个城市占据一行下一行。它只是识别所有城市名称并列出它。
Output:
1 Los Angeles
2 New York
3 Mexico City
4 Mexico
5 Tue
6 London
7 New York
8 Fort Lauderdale
9 Los Angeles
10 Paris
我希望输出类似于:
1 Los Angeles Cabo (from the first tweet in row 2)
2 New York Mexico City Mexico (from the second tweet in row 3)
代码:
#Named Entity Recognition (NER)
bio <- readLines("C:\\xyz\\tweets.csv")
print(bio)
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("openNLPmodels.en",
repos = "http://datacube.wu.ac.at/",
type = "source")
library(NLP)
library(openNLP)
library(RWeka)
library(qdap)
library(openNLPmodels.en)
library(magrittr)
bio <- as.String(bio)
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(2)
words(bio_doc) %>% head(10)
location_ann <- Maxent_Entity_Annotator(kind = "location")
pipeline <- list(sent_ann,
word_ann,
location_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotations(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
entities(bio_doc, kind = "location")
cities <- entities(bio_doc, kind = "location")
library(xlsx)
write.xlsx(cities, "C:\\xyz\\xyz.xlsx")
还有一种方法可以进一步将城市分为原点和目的地,即通过在城市之前对城市进行分类。或者&#39; - &#39;作为原始城市,其余作为目的地城市?