Question

希望你能帮助我!!

我正在收集推文，其中包含created_at日期（DataPublicacao）和一些Hashtags。每条推文都指广播公司（redeId）和节目（programaId）。我想在一定时期内查询数据库中20个最常用的主题标签。

我必须映射每个标签，当它被使用时，以及它所指的广播公司和电视节目。

然后，我需要能够计算某段时间内每个＃标签的出现次数（我不知道如何）。

public class Tweet : IModelo
{
    public string Id { get; set; }
    public string RedeId { get; set; }
    public string ProgramaId { get; set; }
    public DateTime DataPublicacao { get; set; }
    public string Conteudo { get; set; }
    public string Aplicacao { get; set; }
    public Autor Autor { get; set; }
    public Twitter.Monitor.Dominio.Modelo.TweetJson.Geo LocalizacaoGeo { get; set; }
    public Twitter.Monitor.Dominio.Modelo.TweetJson.Place Localizacao { get; set; }
    public Twitter.Monitor.Dominio.Modelo.TweetJson.Entities Entidades { get; set; }
    public string Imagem { get; set; }
    public Autor Para_Usuario { get; set; }
    public string Retweet_Para_Status_Id { get; set; }
}

“实体”是主题标签，用户名和网址。

我尝试按广播，电视节目和文字对主题标签进行分组，并列出事件的发布日期。然后，我必须转换结果，所以我可以计算那个时期的事件。

    public class EntityResult
    {
        public string hashtagText { get; set; }
        public string progId { get; set; }
        public string redeId { get; set; }
        public int listCount { get; set; }
    }

    public class HashtagsIndex : AbstractIndexCreationTask<Tweet, HashtagsIndex.ReduceResults>
    {
        public class ReduceResults
        {
            public string hashtagText { get; set; }
            public DateTime createdAt { get; set; }
            public string progId { get; set; }
            public string redeId { get; set; }
            public List<DateTime> datesList { get; set; }
        }

        public HashtagsIndex()
        {
            Map = tweets => from tweet in tweets
                            from hts in tweet.Entidades.hashtags
                            where tweet.Entidades != null
                            select new
                            {
                                createdAt = tweet.DataPublicacao,
                                progId = tweet.ProgramaId,
                                redeId = tweet.RedeId,
                                hashtagText = hts.text,
                                datesList = new List<DateTime>(new DateTime[] { tweet.DataPublicacao })
                            };

            Reduce = results => from result in results
                                group result by new { result.progId, result.redeId, result.hashtagText }
                                    into g
                                    select new
                                    {
                                        createdAt = DateTime.MinValue,
                                        progId = g.Key.progId,
                                        redeId = g.Key.redeId,
                                        hashtagText = g.Key.hashtagText,
                                        datesList = g.ToList().Select(t => t.createdAt).ToList()
                                    };
        }
    }

我到目前为止的查询是：

                    var hashtags2 = session.Query<dynamic, HashtagsIndex>().Customize(t => t.TransformResults((query, results) =>
                        results.Cast<dynamic>().Select(g =>
                        {
                            Expression<Func<DateTime, bool>> exp = o => o >= dtInit && o <= dtEnd;

                            int count = g.Where(exp);
                            return new EntityResult
                            {
                                redeId = g.redeId,
                                progId = g.progId,
                                hashtagText = g.hashtagText,
                                listCount = count
                            };
                        }))).Take(20).ToList();

现在我需要OrderByDescending（t =＆gt; t.count），所以我不能在那段时间内使用（20）最常用的主题标签。

我该怎么做？

Answer 1

是否可以在mapreduce过程之前过滤项目？

map / reduce索引就像任何其他索引一样。始终通过所有索引处理所有文档。因此，当你问到“之前”的措辞时，答案显然是“不”。

但我认为您只对在索引过程中过滤项目感兴趣，而这很容易在地图中完成：

Map = items => from item in items where item.foo == whatever // this is how you filter select new { // whatever you want to map }

此索引将处理所有文档，但结果索引将只包含与您在where子句中指定的过滤器匹配的项。

是否可以随后按功能分组，例如按年龄划分的用户，然后按地区划分

分组在reduce步骤中完成。这就是map / reduce的全部内容。

我对你的建议（我的意思是没有不尊重），就是在你试图跑之前走路。构建一个简单的原型或一组单元测试，首先尝试基本的存储和检索。然后尝试基本的索引和查询。然后尝试简单地图缩小，例如计算所有推文。只有这样你才能尝试使用其他分组预先映射/减少。如果您遇到麻烦，那么您将获得可以在此处发布的代码以获取帮助。

有可能吗？

当然。一切皆有可能。：）

使用RavenDB进行复杂的MapReduce查询

1 个答案: