我已成功将CSV文件加载到neo4j中,我想删除数据集中的停用词。我在文本文件中有单独的停用词列表。我找到了一个使用停用词的示例代码。但我想用停用词列表代替它。我该如何进行?我们可以在一个查询中加载2个数据集(kbv5.txt和stopwords.txt)吗?
我想在我的代码中包含停用词列表文件
LOAD CSV FROM "file:///kbv5.txt" as row fieldterminator "."
with row
unwind row as text
with reduce(t=tolower(text), delim in
["","",",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as
normalized
with [w in split(normalized," ") | trim(w)] as words
unwind range(0,size(words)-2) as idx
MERGE (w1:Word {name:words[idx]})
ON CREATE SET w1.count = 1
ON MATCH SET w1.count = w1.count + 1
MERGE (w2:Word {name:words[idx+1]})
ON CREATE SET w2.count = 1
ON MATCH SET w2.count = w2.count + (case when idx = size(words)-2 then 1
else 0 end)
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1
使用停用词的示例代码:
with "Great device, but the calls drop too frequently." as text
with replace(replace(tolower(text),".",""),",","") as normalized
with [w in split(normalized," ") | trim(w)] as words
with [w in words WHERE NOT w IN ["the","an","on"]] as words
UNWIND range(0,size(words)-2) as idx
MERGE (w1:Word {name:words[idx]})
MERGE (w2:Word {name:words[idx+1]})
MERGE (w1)-[:NEXT]->(w2)
预先感谢
答案 0 :(得分:0)
此代码演示了如何从文本中删除停用词。试试看;它不会向您的数据库写入任何内容。导入后,您将在代码顶部附近执行此操作。
WITH SPLIT('some of these words are unnecessary',' ') AS text,
SPLIT('are but of in the these',' ') AS stopwords
RETURN FILTER (word IN text WHERE NOT word IN stopwords)