使用用于Apache Beam的Go SDK,我正在尝试使用侧面输入创建PCollection的视图。
但是我遇到了这个奇怪的错误:
Failed to execute job: on ctx= making side input 0:
singleton side input Singleton for int ill-defined
exit status 1
这是我正在使用的代码:
// A PCollection of key/value pairs
pairedWithOne := beam.ParDo(s, func(r models.Review) (string, int) {
return r.DoRecommend, 1
}, col)
// A PCollection of ints (demo)
pcollInts := beam.CreateList(s, [3]int{
1, 2, 3,
})
// A PCollection of key/values pairs
summed := stats.SumPerKey(s, pairedWithOne)
// Here is where I'd like to use my side input.
mapped := beam.ParDo(s, func(k string, v int, side int, emit func(ratio
models.RecommendRatio)) {
var ratio = models.RecommendRatio{
DoRecommend: k,
NumVotes: v,
}
emit(ratio)
}, summed, beam.SideInput{Input: pcollInts})
我在git上找到了此示例:
// Side Inputs
//
// While a ParDo processes elements from a single "main input" PCollection, it
// can take additional "side input" PCollections. These SideInput along with
// the DoFn parameter form express styles of accessing PCollection computed by
// earlier pipeline operations, passed in to the ParDo transform using SideInput
// options, and their contents accessible to each of the DoFn operations. For
// example:
//
// words := ...
// cufoff := ... // Singleton PCollection<int>
// smallWords := beam.ParDo(s, func (word string, cutoff int, emit func(string)) {
// if len(word) < cutoff {
// emit(word)
// }
// }, words, beam.SideInput{Input: cutoff})
更新:似乎Impulse(scope)
函数在这里起作用,但是我不知道是什么。从GoDoc:
Impulse emits a single empty []byte into the global window. The resulting PCollection is a singleton of type []byte.
The purpose of Impulse is to trigger another transform, such as ones that take all information as side inputs.
如果这可以帮助,请在这里查看我的结构:
type Review struct {
Date time.Time `csv:"date" json:"date"`
DoRecommend string `csv:"doRecommend" json:"doRecommend"`
NumHelpful int `csv:"numHelpful" json:"numHelpful"`
Rating int `csv:"rating" json:"rating"`
Text string `csv:"text" json:"text"`
Title string `csv:"title" json:"title"`
Username string `csv:"username" json:"username"`
}
type RecommendRatio struct {
DoRecommend string `json:"doRecommend"`
NumVotes int `json:"numVotes"`
}
有什么解决办法吗?
谢谢
答案 0 :(得分:0)
更新:
这可以通过删除beam.Impulse()
函数来简化(我认为错误的类型在这里引起了麻烦):
mapped := beam.ParDo(s,
func(k string, v int,
sideCounted int,
emit func(ratio models.RecommendRatio)) {
p := percent.PercentOf(v, sideCounted)
emit(models.RecommendRatio{
DoRecommend: k,
NumVotes: v,
Percent: p,
})
}, summed,
beam.SideInput{Input: counted})
旧: 似乎我已经找到了解决方案,也许只是一种解决方法,寻求快速审核并为改进留有余地。 (我认为该函数不是幂等的,因为如果它可以在多个节点工作程序上执行多次,则append()函数将复制条目...)
但是这里的全局思想是使用[]uint8 byte
函数创建beam.Impulse(scope)
的单例PCollection并将所有“真实”数据作为边输入传递。
// Pair each recommendation value with one -> PColl<KV<string, int>>
pairedWithOne := beam.ParDo(s, func(r models.Review) (string, int) {
return r.DoRecommend, 1
}, col)
// Sum num occurrences of a recommendation k/v pair
summed := stats.SumPerKey(s, pairedWithOne)
// Drop keys for latter global count
droppedKey := beam.DropKey(s, pairedWithOne)
// Count globally the number of recommendation values -> PColl<int>
counted := stats.Sum(s, droppedKey)
// Map to a struct with percentage per ratio
mapped := beam.ParDo(s,
func(_ []uint8,
sideSummed func(k *string, v *int) bool,
sideCounted int,
emit func(ratio []models.RecommendRatio)) {
var k string
var v int
var ratios []models.RecommendRatio
for sideSummed(&k, &v) {
p := percent.PercentOf(v, sideCounted)
ratio := models.RecommendRatio{
DoRecommend: k,
NumVotes: v,
Percent: p,
}
ratios = append(ratios, ratio)
}
emit(ratios)
}, beam.Impulse(s),
beam.SideInput{Input: summed},
beam.SideInput{Input: counted})