如何在mongoDB中将字符串拆分为多个字符串

时间:2018-05-16 17:44:00

标签: mongodb

我正在尝试拆分这个字符串,这样我就可以计算出他以后使用map reduce包含的相同长度的单词数量。

例如,对于句子

  

支持真相是一个女人 - 那么呢?   我会 -

[
  {length:”1”, number:”1”}, 
  {length:”2”, number:”1”},
  {length:”4”, number:”3”},
  {length:”5”, number:”2”},
  {length:”9”, number:”1”}
]

我该怎么做?

2 个答案:

答案 0 :(得分:0)

此样本聚合将计算相同长度的单词。希望它会对你有所帮助:

db.some.remove({})
db.some.save({str:"red brown fox jumped over the hil"})

var res = db.some.aggregate(
    [
    { $project : { word : { $split: ["$str", " "] }} },
    { $unwind : "$word" },
    { $project : { len : { $strLenCP: "$word" }} },
    { $group : { _id : { len : "$len"}, same: {$push:"$len"}}},
    { $project : { len : "$len", count : {$size : "$same"} }}
    ]
)

printjson(res.toArray());

答案 1 :(得分:0)

您的问题的答案在很大程度上取决于您对单词的定义。如果它只是A-Z或a-z字符的连续序列,那么这里是一个完全疯狂的方法,但是,它会为您提供您要求的确切结果。

此代码的作用是有效的

  1. 解析输入字符串以消除不匹配的字符(所以任何不是A-Z或a-z的字符)。
  2. 连接生成的已清理字符串,该字符串仅包含有效字符。
  3. 按空格字符拆分生成的字符串。
  4. 计算所有找到的单词的长度。
  5. 按长度和计数实例分组。
  6. 一些美化输出。
  7. 给出以下输入文件

    {
        "text" : "SUPPOSING that Truth is a woman--what then?"
    }
    

    以下管道

    db.collection.aggregate({
        $project: { // lots of magic to calulate an array that will hold the lengths of all words
            "lengths": {
                $map: { // translate a given word into its length
                    input: {
                        $split: [ // split cleansed string by space character
                            { $reduce: { // join the characters that are between A and z
                                    input: {
                                        $map: { // to traverse the original input string character by character
                                            input: {
                                                $range: [ 0, { $strLenCP: "$text" } ] // we wamt to traverse the entire string from index 0 all the way until the last character
                                            },
                                            as: "index",
                                            in: {
                                                $let: {
                                                    vars: {
                                                        "char": { // temp. result which will be reused several times below
                                                            $substrCP: [ "$text", "$$index", 1 ] // the single character we look at in this loop
                                                        }
                                                    },
                                                    in: {
                                                        $cond: [ // some value that depends on whether the character we look at is between 'A' and 'z'
                                                            { $and: [
                                                                { $eq: [ { $cmp: [ "$$char", "@" /* ASCII 64,  65  would be 'A' */] },  1 ] }, // is our character greater than or equal to 'A'
                                                                { $eq: [ { $cmp: [ "$$char", "{" /* ASCII 123, 122 would be 'z' */] }, -1 ] }  // is our character less than    or equal to 'z' 
                                                            ]},
                                                            '$$char', // in which case that character will be taken
                                                            ' ' // and otherwise a space character to add a word boundary
                                                        ]
                                                    }
                                                }
                                            }
                                        }
                                    },
                                    initialValue: "", // starting with an empty string
                                    in: {
                                        $concat: [ // we join all array values by means of concatenating
                                            "$$value", // the current value with
                                            "$$this"
                                        ]
                                    }
                                }
                            },
                            " "
                        ]
                    },
                    as: "word",
                    in: {
                        $strLenCP: "$$word" // we map a word into its length, e.g. "the" --> 3
                    }
                }
            }
        }
    }, {
        $unwind: "$lengths" // flatten the array which holds all our word lengths
    }, {
        $group: {
            _id : "$lengths", // group by the length of our words
            "number": { $sum: 1 }  // count number of documents per group
        } 
    }, {
        $match: {
            "_id": { $ne: 0 } // $split might leave us with strings of length 0 which we do not want in the result
        }
    }, {
        $project: {
            "_id": 0, // remove the "_id" field
            "length" : "$_id", // length is our group key
            "number" : "$number" // and this is the number of findings
        }
    }, {
        $sort: { "length": 1 } // sort by length ascending
    })
    

    将产生所需的输出

    [
        { "length" : 1, "number" : 1.0 },
        { "length" : 2, "number" : 1.0 },
        { "length" : 4, "number" : 3.0 },
        { "length" : 5, "number" : 2.0 },
        { "length" : 9, "number" : 1.0 }
    ]