通过Neo4J中的全文搜索创建关系

时间:2014-09-18 21:30:20

标签: neo4j full-text-search cypher

我构建了一个Neo4j图形数据库,其中包含大约50,000个带有标签DIAGNOSE的节点,每个节点都有一个字符串属性TEXT,最多50个字符。相同的图形数据库包含大约120,000个带有标签BASETEXT的节点,每个节点都有一个字符串属性TEXTVALUE,最多175,000个字符。我的目标是建立一种关系(b:BASETEXT) - [:ASSOCIATED] - > (d:DIAGNOSE)如果DIAGNOSE.TEXT包含在BASETEXT.TEXTVALUE中 - 导致总共约2.9 * 10 ^ 9次搜索。我在cypher中尝试了以下两种方法:

方法1:

match (b:BASETEXT), (d:DIAGNOSE)
where b.TEXTVALUE =~ (".* " + d.TEXT + " .*")
merge (b) -[:ASSOCIATED]-> (d);

方法2(在每个DIAGNOSE节点和每个BASETEXT节点之间创建一个关系,如果TEXT在TEXTVALUE中,赋值为true,关系属性为CONTAINED,否则为false,最后删除与ASSOCIATED.CONTAINED = false的所有关系):

match (b:BASETEXT), (d:DIAGNOSE)
where not (b) -[:ASSOCIATED]-> (d)
with b, d limit 20000
create (b) -[a:ASSOCIATED]-> (d)
with b, d, a
set a.CONTAINED =
case
when (b.TEXTVALUE =~ (".* " + d.TEXT + " .*")) then true
else false
end 
return count(a);

上述方法均无效。方法1在半小时内没有找到结束,方法2找到结束但是我将花费60天。 任何建议如何在Neo4J中正确实现文本搜索并解决问题 - 最好是在Cypher?

1 个答案:

答案 0 :(得分:0)

我还创建了一个graph-gist for it, for illustration

我认为这是一个在cypher中得不到支持的用例。

如果你想尝试一个简单的密码变体应该可以工作(但是慢慢地)试试这个:

MATCH (d:DIAGNOSE)
WHERE NOT () -[:ASSOCIATED]-> (d)
WITH d 
SKIP 0 LIMIT 1000
MATCH (b:BASETEXT)
WHERE (b.TEXTVALUE =~ (".* " + d.TEXT + " .*"))
CREATE (b) -[:ASSOCIATED]-> (d)
RETURN count(*);

在Java中应该快得多:

public class ConnectIndexTest {
    private static final String PATH = "target/connect.db";
    public static final Label BASETEXT = DynamicLabel.label("BASETEXT");
    public static final Label DIAGNOSE = DynamicLabel.label("DIAGNOSE");
    public static final String TEXTVALUE = "TEXTVALUE";
    public static final String TEXT = "TEXT";
    public static final String INDEX_NAME = "basetext";
    private static final RelationshipType ASSOCIATED = DynamicRelationshipType.withName("ASSOCIATED");
    private GraphDatabaseService db;

    @Before
    public void setUp() throws Exception {
//      db = new GraphDatabaseFactory().newEmbeddedDatabase(PATH);
        db = new TestGraphDatabaseFactory().newImpermanentDatabase();
        try (Transaction tx = db.beginTx()) {
            for (int i = 100_000; i < 250_000; i++) db.createNode(BASETEXT).setProperty(TEXTVALUE, "foo " + i + " bar");
            tx.success();
        }
        try (Transaction tx = db.beginTx()) {
            for (int i = 100_000; i < 250_000; i += 2) db.createNode(DIAGNOSE).setProperty(TEXT, String.valueOf(i));
            tx.success();
        }
    }

    // 120k BASETEXT Nodes
    // 50k DIAGNOSE Nodes
    @Test
    public void testConnect() throws Exception {
        GlobalGraphOperations ops = GlobalGraphOperations.at(db);
        try (Transaction tx = db.beginTx()) {
            Index<Node> index = db.index().forNodes(INDEX_NAME, LuceneIndexImplementation.FULLTEXT_CONFIG);
            for (Node baseText : ops.getAllNodesWithLabel(BASETEXT)) {
                index.add(baseText, TEXTVALUE, baseText.getProperty(TEXTVALUE));
            }
            tx.success();
        }
        int count = 0;
        Transaction tx = db.beginTx();
        try {
            Index<Node> index = db.index().forNodes(INDEX_NAME);
            for (Node diagnose : ops.getAllNodesWithLabel(DIAGNOSE)) {
                String text = (String) diagnose.getProperty(TEXT);
                IndexHits<Node> hits = index.query(TEXTVALUE, "\"" + text + "\"");// quote in case text contains spaces
                for (Node baseText : hits) {
                    baseText.createRelationshipTo(diagnose, ASSOCIATED);
                    // batch transaction
                    if (++count % 50000 == 0) {
                        System.out.println("count = " + count);
                        tx.success();
                        tx.close();
                        tx = db.beginTx();
                    }
                }
            }

            tx.success();
        } finally {
            tx.close();
        }
        System.out.println("count = " + count);
    }
}