我构建了一个Neo4j图形数据库,其中包含大约50,000个带有标签DIAGNOSE的节点,每个节点都有一个字符串属性TEXT,最多50个字符。相同的图形数据库包含大约120,000个带有标签BASETEXT的节点,每个节点都有一个字符串属性TEXTVALUE,最多175,000个字符。我的目标是建立一种关系(b:BASETEXT) - [:ASSOCIATED] - > (d:DIAGNOSE)如果DIAGNOSE.TEXT包含在BASETEXT.TEXTVALUE中 - 导致总共约2.9 * 10 ^ 9次搜索。我在cypher中尝试了以下两种方法:
方法1:
match (b:BASETEXT), (d:DIAGNOSE)
where b.TEXTVALUE =~ (".* " + d.TEXT + " .*")
merge (b) -[:ASSOCIATED]-> (d);
方法2(在每个DIAGNOSE节点和每个BASETEXT节点之间创建一个关系,如果TEXT在TEXTVALUE中,赋值为true,关系属性为CONTAINED,否则为false,最后删除与ASSOCIATED.CONTAINED = false的所有关系):
match (b:BASETEXT), (d:DIAGNOSE)
where not (b) -[:ASSOCIATED]-> (d)
with b, d limit 20000
create (b) -[a:ASSOCIATED]-> (d)
with b, d, a
set a.CONTAINED =
case
when (b.TEXTVALUE =~ (".* " + d.TEXT + " .*")) then true
else false
end
return count(a);
上述方法均无效。方法1在半小时内没有找到结束,方法2找到结束但是我将花费60天。 任何建议如何在Neo4J中正确实现文本搜索并解决问题 - 最好是在Cypher?
答案 0 :(得分:0)
我还创建了一个graph-gist for it, for illustration。
我认为这是一个在cypher中得不到支持的用例。
如果你想尝试一个简单的密码变体应该可以工作(但是慢慢地)试试这个:
MATCH (d:DIAGNOSE)
WHERE NOT () -[:ASSOCIATED]-> (d)
WITH d
SKIP 0 LIMIT 1000
MATCH (b:BASETEXT)
WHERE (b.TEXTVALUE =~ (".* " + d.TEXT + " .*"))
CREATE (b) -[:ASSOCIATED]-> (d)
RETURN count(*);
在Java中应该快得多:
public class ConnectIndexTest {
private static final String PATH = "target/connect.db";
public static final Label BASETEXT = DynamicLabel.label("BASETEXT");
public static final Label DIAGNOSE = DynamicLabel.label("DIAGNOSE");
public static final String TEXTVALUE = "TEXTVALUE";
public static final String TEXT = "TEXT";
public static final String INDEX_NAME = "basetext";
private static final RelationshipType ASSOCIATED = DynamicRelationshipType.withName("ASSOCIATED");
private GraphDatabaseService db;
@Before
public void setUp() throws Exception {
// db = new GraphDatabaseFactory().newEmbeddedDatabase(PATH);
db = new TestGraphDatabaseFactory().newImpermanentDatabase();
try (Transaction tx = db.beginTx()) {
for (int i = 100_000; i < 250_000; i++) db.createNode(BASETEXT).setProperty(TEXTVALUE, "foo " + i + " bar");
tx.success();
}
try (Transaction tx = db.beginTx()) {
for (int i = 100_000; i < 250_000; i += 2) db.createNode(DIAGNOSE).setProperty(TEXT, String.valueOf(i));
tx.success();
}
}
// 120k BASETEXT Nodes
// 50k DIAGNOSE Nodes
@Test
public void testConnect() throws Exception {
GlobalGraphOperations ops = GlobalGraphOperations.at(db);
try (Transaction tx = db.beginTx()) {
Index<Node> index = db.index().forNodes(INDEX_NAME, LuceneIndexImplementation.FULLTEXT_CONFIG);
for (Node baseText : ops.getAllNodesWithLabel(BASETEXT)) {
index.add(baseText, TEXTVALUE, baseText.getProperty(TEXTVALUE));
}
tx.success();
}
int count = 0;
Transaction tx = db.beginTx();
try {
Index<Node> index = db.index().forNodes(INDEX_NAME);
for (Node diagnose : ops.getAllNodesWithLabel(DIAGNOSE)) {
String text = (String) diagnose.getProperty(TEXT);
IndexHits<Node> hits = index.query(TEXTVALUE, "\"" + text + "\"");// quote in case text contains spaces
for (Node baseText : hits) {
baseText.createRelationshipTo(diagnose, ASSOCIATED);
// batch transaction
if (++count % 50000 == 0) {
System.out.println("count = " + count);
tx.success();
tx.close();
tx = db.beginTx();
}
}
}
tx.success();
} finally {
tx.close();
}
System.out.println("count = " + count);
}
}