我有两个数据框,每个数据框有两列
“事件”数据框在其所在地区的村庄中发生了地区名称和事件。 “城市”栏是一个发送数据,其中包含隐藏在各个国家/地区之间的相应区域的村名称。
data in event:
District City
Dst1 Dance program occured in village1 near highway
Dst1 Regional gatherting atvillage2 --> note: "at" and "village2" typo error partial search required !
Dst2 Music showsin village3with famous songs --> note: i am not concerned about "showsin" but "village1with" has village3 name
Dst3 Sunset is pretty nice in these area
.
.
.
.
.
“村庄”数据框在每个区域下都有地区名称和完整的村庄列表
data in villages:
District Village
Dst1 Village1
Dst1 Village2
Dst2 Village1
.
.
.
.
.
要求:
最终所需的输出(“事件”必须添加一列“乡村”)
event:
District City Village
--------------------------------------------------------------------------------------------
Dst1 Dance program occured in village1 near highway Village1 --> present in df2
Dst1 Regional gatherting atvillage2 Village2 --> present in df2
Dst2 Music showsin village3with famous songs Village3 --> present in df2
Dst3 Sunset is pretty nice in these area Dst3 --> no village name mentioned and not found in df2
我在R代码下面试过..
我尝试使用grepl和pmatch ..不满意结果..任何人都可以帮助我调整代码并获得更准确的结果
for (i in (1:length(event$City))){
found<-FALSE
if ((!Trim(event$District[i]) == "")&(!Trim(event$District[i]) == "fatehgarh")){
# Pick all the villages from a district
village<-filter(villages, DISTRICT.NAME == Trim(event$District[i]))
# Village column has district name as well. just dropping
data1<-as.data.frame(sapply(village$Area.Name, function(x)
gsub(paste(Trim(event$District[i]), collapse = '|'), '', x)))
data1=data1[(!data1=="")]
# Remove any unwanted special chars from the City sentances
text<-str_replace_all(event$City[i], "[[:punct:]]", "")
len=length(data1[[1]])
print (paste0("i is : ", i))
x=strsplit(text, " ")[[1]]
# Loop through each village
for (j in (1:len)){
y<-paste0(strsplit(as.character(data1[j])," ")[[1]], collapse = " ")
if (!isTRUE(all.equal(y, character(0)))){
bool<-grepl(paste0(x, collapse = "|"), y)
# match<-pmatch(strsplit(y, " ")[[1]], x)
# bool=all(is.na(match))
if (!bool){
event$Village[i]=y
found<-TRUE
count1 = count1 +1
print(paste0("Village - ", y))
break()
}
}
}
if (!found){
event$Village[i]=event$District[i]
count2 = count2 +1
}
}
}
示例数据帧:
city=c("Regional priests vising **basavapattana**in karnataka",
"Showman of bollywood visitedhonnali lastweek",
"Reliance has launched jio tower in chennagiri",
"Railway track damaged near sargara",
"Electioncommittee passed strict order to SorabaMLA",
"Rural villages are now provided with WiFi facility"
)
district = c("Davanagere", "Shimoga", "Shimoga", "Shimoga", "Shimoga", "Dharwad")
event = data.frame(city, district)
district=c("Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Shimoga", "Shimoga", "Shimoga", "Shimoga", "Shimoga")
villages=c("Arasapura", "Attigere", "Avaragolla", "Bada", "Balluru", "Basavanahalu", "**Basavapatna**", "Batlekatte", "Bavihalu", "Belavanur", "Hosanagara", "Sagar", "Shikarpur", "Sorab", "Tirthahalli")
villages = data.frame(district, villages)
另外,如果仔细观察,那么Bold或双星都是一样的,只是拼写错误..但它必须选择Basavapatna ......
此外,数据量非常大..在优化方面也需要一些帮助..