Question

我有一个应用程序可以传输存储在Neo4j数据库中的Twitter数据。我存储的数据涉及推文，用户，主题标签及其关系（用户发布推文，推文标签主题标签，用户转发推文）。现在，每次我收到一条新推文，我所做的就是：

检查数据库是否已包含推文：如果是，我用新信息更新它（转推计数，如计数），否则我保存它
检查数据库是否已包含用户：如果是，我使用新信息更新它，否则我保存它
检查数据库是否已包含主题标签：如果没有，请添加

依此类推，保存关系的过程相同。

以下是查询：

static String cqlAddTweet = "merge (n:Tweet{tweet_id: {2}}) on create set n.text={1}, n.location={3}, n.likecount={4}, n.retweetcount={5}, n.topic={6}, n.created_at={7} on match set n.likecount={4}, n.retweetcount={5}";
static String cqlAddHT = "merge (n:Hashtag{text:{1}})";
static String cqlHTToTweet = "match (n:Tweet),(m:Hashtag) where n.tweet_id={1} and m.text={2} merge (n)-[:TAGS]->(m)";
static String cqlAddUser = "merge (n:User{user_id:{3}}) on create set n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6} on match set n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6}";
static String cqlUserToTweet = "match (n:User),(m:Tweet) where m.tweet_id={2} and n.user_id={1} merge (n)-[:POSTS]->(m)";
static String cqlUserRetweets = "match (n:Tweet{tweet_id:{1}}), (u:User{user_id:{2}}) create (u)-[:RETWEETS]->(n)";

由于保存数据非常慢，我想如果我没有运行所有那些每次都扫描数据的查询，那么这个系统可以有更好的性能。

您有什么建议可以改善我的申请吗？

如果这看起来很愚蠢，请提前告诉我。

Answer 1

确保您在以下标签/属性对上拥有indexes（或uniqueness constraints，如果适用）。这将允许您的查询避免扫描具有相同标签的所有节点（在开始查询时）。

:Hashtag(text)
:User(user_id)
static String cqlAddTweet = "MERGE (n:Tweet{tweet_id: {2}}) ON CREATE SET n.text={1}, n.location={3}, n.topic={6}, n.created_at={7} SET n.likecount={4}, n.retweetcount={5}"; static String cqlAddUser = "MERGE (n:User{user_id:{3}}) SET n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6}";

顺便说一下，您的几个查询可以简化（但这不会影响性能）：

bazel build tensorflow/tools/graph_transforms:summarize_graph

bazel-bin/tensorflow/tools/graph_transforms/summarize_graph \
--in_graph=/path/to/your_frozen.pb

查询使Twitter流应用程序在保存数据方面过于缓慢

1 个答案: