使用文本文件删除Neo4j中的停用词

时间:2018-09-28 10:20:13

标签: neo4j cypher graph-databases

我已成功将CSV文件加载到neo4j中,我想删除数据集中的停用词。我在文本文件中有单独的停用词列表。我找到了一个使用停用词的示例代码。但我想用停用词列表代替它。我该如何进行?我们可以在一个查询中加载2个数据集(kbv5.txt和stopwords.txt)吗?

我想在我的代码中包含停用词列表文件

LOAD CSV FROM "file:///kbv5.txt"  as row fieldterminator "."
with row
unwind row as text
with reduce(t=tolower(text), delim in 
["","",",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as 
normalized
with [w in split(normalized," ") | trim(w)] as words
unwind range(0,size(words)-2) as idx
MERGE (w1:Word {name:words[idx]})
ON CREATE SET w1.count = 1
ON MATCH SET w1.count = w1.count + 1
MERGE (w2:Word {name:words[idx+1]})
ON CREATE SET w2.count = 1
ON MATCH SET w2.count = w2.count + (case when idx = size(words)-2 then 1 
else 0 end)
MERGE (w1)-[r:NEXT]->(w2)
 ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1

使用停用词的示例代码:

with "Great device, but the calls drop too frequently." as text
with replace(replace(tolower(text),".",""),",","") as normalized
with [w in split(normalized," ") | trim(w)] as words
with [w in words WHERE NOT w IN ["the","an","on"]] as words
UNWIND range(0,size(words)-2) as idx
MERGE (w1:Word {name:words[idx]})
MERGE (w2:Word {name:words[idx+1]})
MERGE (w1)-[:NEXT]->(w2)

预先感谢

1 个答案:

答案 0 :(得分:0)

此代码演示了如何从文本中删除停用词。试试看;它不会向您的数据库写入任何内容。导入后,您将在代码顶部附近执行此操作。

WITH SPLIT('some of these words are unnecessary',' ') AS text, 
     SPLIT('are but of in the these',' ') AS stopwords
RETURN FILTER (word IN text WHERE NOT word IN stopwords)