转到SDK Apache Beam:单边输入Single int定义不正确

时间:2019-09-09 12:07:10

标签: go apache-beam

使用用于Apache Beam的Go SDK,我正在尝试使用侧面输入创建PCollection的视图。

但是我遇到了这个奇怪的错误:

Failed to execute job: on ctx=      making side input 0:
singleton side input Singleton for int ill-defined
exit status 1

这是我正在使用的代码:

// A PCollection of key/value pairs
pairedWithOne := beam.ParDo(s, func(r models.Review) (string, int) {
        return r.DoRecommend, 1
    }, col)

// A PCollection of ints (demo)
pcollInts := beam.CreateList(s, [3]int{
        1, 2, 3,
})

// A PCollection of key/values pairs
summed := stats.SumPerKey(s, pairedWithOne)

// Here is where I'd like to use my side input.
mapped := beam.ParDo(s, func(k string, v int, side int, emit func(ratio 
models.RecommendRatio)) {
        var ratio = models.RecommendRatio{
            DoRecommend: k,
            NumVotes:    v,
        }

        emit(ratio)
    }, summed, beam.SideInput{Input: pcollInts})

我在git上找到了此示例:

// Side Inputs
//
// While a ParDo processes elements from a single "main input" PCollection, it
// can take additional "side input" PCollections. These SideInput along with
// the DoFn parameter form express styles of accessing PCollection computed by
// earlier pipeline operations, passed in to the ParDo transform using SideInput
// options, and their contents accessible to each of the DoFn operations. For
// example:
//
//     words := ...
//     cufoff := ...  // Singleton PCollection<int>
//     smallWords := beam.ParDo(s, func (word string, cutoff int, emit func(string)) {
//           if len(word) < cutoff {
//                emit(word)
//           }
//     }, words, beam.SideInput{Input: cutoff})

更新:似乎Impulse(scope)函数在这里起作用,但是我不知道是什么。从GoDoc:

Impulse emits a single empty []byte into the global window. The resulting PCollection is a singleton of type []byte.

The purpose of Impulse is to trigger another transform, such as ones that take all information as side inputs.

如果这可以帮助,请在这里查看我的结构:

type Review struct {
    Date        time.Time `csv:"date" json:"date"`
    DoRecommend string    `csv:"doRecommend" json:"doRecommend"`
    NumHelpful  int       `csv:"numHelpful" json:"numHelpful"`
    Rating      int       `csv:"rating" json:"rating"`
    Text        string    `csv:"text" json:"text"`
    Title       string    `csv:"title" json:"title"`
    Username    string    `csv:"username" json:"username"`
}

type RecommendRatio struct {
    DoRecommend string `json:"doRecommend"`
    NumVotes    int    `json:"numVotes"`
}

有什么解决办法吗?

谢谢

1 个答案:

答案 0 :(得分:0)

更新

这可以通过删除beam.Impulse()函数来简化(我认为错误的类型在这里引起了麻烦):

mapped := beam.ParDo(s,
        func(k string, v int,
            sideCounted int,
            emit func(ratio models.RecommendRatio)) {

            p := percent.PercentOf(v, sideCounted)

            emit(models.RecommendRatio{
                DoRecommend: k,
                NumVotes:    v,
                Percent:     p,
            })

        }, summed,
        beam.SideInput{Input: counted})

旧: 似乎我已经找到了解决方案,也许只是一种解决方法,寻求快速审核并为改进留有余地。 (我认为该函数不是幂等的,因为如果它可以在多个节点工作程序上执行多次,则append()函数将复制条目...)

但是这里的全局思想是使用[]uint8 byte函数创建beam.Impulse(scope)的单例PCollection并将所有“真实”数据作为边输入传递。

    // Pair each recommendation value with one -> PColl<KV<string, int>>
    pairedWithOne := beam.ParDo(s, func(r models.Review) (string, int) {
        return r.DoRecommend, 1
    }, col)

    // Sum num occurrences of a recommendation k/v pair
    summed := stats.SumPerKey(s, pairedWithOne)

    // Drop keys for latter global count
    droppedKey := beam.DropKey(s, pairedWithOne)

    // Count globally the number of recommendation values -> PColl<int>
    counted := stats.Sum(s, droppedKey)

    // Map to a struct with percentage per ratio
    mapped := beam.ParDo(s,
        func(_ []uint8,
            sideSummed func(k *string, v *int) bool,
            sideCounted int,
            emit func(ratio []models.RecommendRatio)) {

            var k string
            var v int
            var ratios []models.RecommendRatio


            for sideSummed(&k, &v) {
                p := percent.PercentOf(v, sideCounted)

                ratio := models.RecommendRatio{
                    DoRecommend: k,
                    NumVotes:    v,
                    Percent:     p,
                }

                ratios = append(ratios, ratio)
            }

            emit(ratios)

        }, beam.Impulse(s),
        beam.SideInput{Input: summed},
        beam.SideInput{Input: counted})