我正在尝试使用goroutine解析一个巨大的维基词典转储,并且遇到一个奇怪的错误,其中goroutine正在读取的通道似乎每次通道阻塞时都会丢失和破坏数据。
func main() {
inFile, err := os.Open(*srcFile)
if err != nil {
log.LogErrorf("Error opening dump: %v", err)
return
}
defer inFile.Close()
var wg sync.WaitGroup
input := make(chan []byte, 51)
go func() {
wg.Add(1)
for line := range input {
log.Printf("Bytes: %s", line)
// process the line
}
wg.Done()
}()
scanner := bufio.NewScanner(inFile)
count := 0
for scanner.Scan() {
count++
log.Printf("Scanned: %d", count)
if err := scanner.Err(); err != nil {
log.LogErrorf("Error scanning: %v", err)
}
newestBytes := scanner.Bytes()
log.Printf("Bytes: %s", newestBytes)
input <- newestBytes
}
close(input)
wg.Wait()
}
当我运行它时,我得到正确的输出。特别是,注意线51和52。
2014/08/03 17:49:25 Scanned: 42
2014/08/03 17:49:25 Bytes: <namespace key="115" case="case-sensitive">Citations talk</namespace>
2014/08/03 17:49:25 Scanned: 43
2014/08/03 17:49:25 Bytes: <namespace key="116" case="case-sensitive">Sign gloss</namespace>
2014/08/03 17:49:25 Scanned: 44
2014/08/03 17:49:25 Bytes: <namespace key="117" case="case-sensitive">Sign gloss talk</namespace>
2014/08/03 17:49:25 Scanned: 45
2014/08/03 17:49:25 Bytes: <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:49:25 Scanned: 46
2014/08/03 17:49:25 Bytes: <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:49:25 Scanned: 47
2014/08/03 17:49:25 Bytes: </namespaces>
2014/08/03 17:49:25 Scanned: 48
2014/08/03 17:49:25 Bytes: </siteinfo>
2014/08/03 17:49:25 Scanned: 49
2014/08/03 17:49:25 Bytes: <page>
2014/08/03 17:49:25 Scanned: 50
2014/08/03 17:49:25 Bytes: <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:49:25 Scanned: 51
2014/08/03 17:49:25 Bytes: <ns>4</ns>
2014/08/03 17:49:25 Scanned: 52
2014/08/03 17:49:25 Bytes: <id>6</id>
2014/08/03 17:49:25 Scanned: 53
2014/08/03 17:49:25 Bytes: <restrictions>edit=autoconfirmed:move=sysop</restrictions>
2014/08/03 17:49:25 Scanned: 54
2014/08/03 17:49:25 Bytes: <revision>
2014/08/03 17:49:25 Scanned: 55
2014/08/03 17:49:25 Bytes: <id>24557508</id>
2014/08/03 17:49:25 Scanned: 56
2014/08/03 17:49:25 Bytes: <parentid>19020708</parentid>
2014/08/03 17:49:25 Scanned: 57
2014/08/03 17:49:25 Bytes: <timestamp>2013-12-30T13:50:49Z</timestamp>
2014/08/03 17:49:25 Scanned: 58
2014/08/03 17:49:25 Bytes: <contributor>
2014/08/03 17:49:25 Scanned: 59
然而,当我打印线而不是(goroutine正在接收的)时,我得到下面的输出。在第51行之后,通道阻塞并主扫描并将51个更多值传递给通道。然而,goroutine读取的下一行是不正确的,而且不止于此,它显然是不正确的。
Bytes: <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:40:52 Bytes: <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:40:52 Bytes: </namespaces>
2014/08/03 17:40:52 Bytes: </siteinfo>
2014/08/03 17:40:52 Bytes: <page>
2014/08/03 17:40:52 Bytes: <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:40:52 Scanned: 52
2014/08/03 17:40:52 Scanned: 53
2014/08/03 17:40:52 Scanned: 54
2014/08/03 17:40:52 Scanned: 55
2014/08/03 17:40:52 Scanned: 56
2014/08/03 17:40:52 Scanned: 57
2014/08/03 17:40:52 Scanned: 58
2014/08/03 17:40:52 Scanned: 59
2014/08/03 17:40:52 Scanned: 60
2014/08/03 17:40:52 Scanned: 61
2014/08/03 17:40:52 Scanned: 62
2014/08/03 17:40:52 Scanned: 63
2014/08/03 17:40:52 Scanned: 64
2014/08/03 17:40:52 Scanned: 65
2014/08/03 17:40:52 Scanned: 66
2014/08/03 17:40:52 Scanned: 67
2014/08/03 17:40:52 Scanned: 68
2014/08/03 17:40:52 Scanned: 69
2014/08/03 17:40:52 Scanned: 70
2014/08/03 17:40:52 Scanned: 71
2014/08/03 17:40:52 Scanned: 72
2014/08/03 17:40:52 Scanned: 73
2014/08/03 17:40:52 Scanned: 74
2014/08/03 17:40:52 Scanned: 75
2014/08/03 17:40:52 Scanned: 76
2014/08/03 17:40:52 Scanned: 77
2014/08/03 17:40:52 Scanned: 78
2014/08/03 17:40:52 Scanned: 79
2014/08/03 17:40:52 Scanned: 80
2014/08/03 17:40:52 Scanned: 81
2014/08/03 17:40:52 Scanned: 82
2014/08/03 17:40:52 Scanned: 83
2014/08/03 17:40:52 Scanned: 84
2014/08/03 17:40:52 Scanned: 85
2014/08/03 17:40:52 Scanned: 86
2014/08/03 17:40:52 Scanned: 87
2014/08/03 17:40:52 Scanned: 88
2014/08/03 17:40:52 Scanned: 89
2014/08/03 17:40:52 Scanned: 90
2014/08/03 17:40:52 Scanned: 91
2014/08/03 17:40:52 Scanned: 92
2014/08/03 17:40:52 Scanned: 93
2014/08/03 17:40:52 Scanned: 94
2014/08/03 17:40:52 Scanned: 95
2014/08/03 17:40:52 Scanned: 96
2014/08/03 17:40:52 Scanned: 97
2014/08/03 17:40:52 Scanned: 98
2014/08/03 17:40:52 Scanned: 99
2014/08/03 17:40:52 Scanned: 100
2014/08/03 17:40:52 Scanned: 101
2014/08/03 17:40:52 Scanned: 102
2014/08/03 17:40:52 Bytes: nd other refer
2014/08/03 17:40:52 Bytes: nce and instru
2014/08/03 17:40:52 Bytes: tional materials. It stipulates that any copy of the material,
2014/08/03 17:40:52 Bytes: even if modifi
2014/08/03 17:40:52 Bytes: d, carry the same licen
2014/08/03 17:40:52 Bytes: e. Those copies may be sold but, if
2014/08/03 17:40:52 Bytes: produced in quantity, have to be made available i
2014/08/03 17:40:52 Bytes: a format which fac
2014/08/03 17:40:52 Bytes: litates further editing.
我试图在Go游乐场重现这一点但我没有成功 - 看起来这与片段在频道中传递的方式有关。
答案 0 :(得分:8)
函数Scanner.Bytes可能会返回扫描仪内部使用的相同切片。
func (s *Scanner) Bytes() []byte
Bytes返回通过调用Scan生成的最新令牌。底层数组可能指向将被后续Scan扫描覆盖的数据。它没有分配。
根据文档,后续调用Scanner.Scan
可能会覆盖此切片。由于您的代码不能确保在下一次调用Scanner.Scan
之后未使用此切片(实际上您的代码生成行并且异常地使用它们),因此它可能包含您尝试使用的点处的垃圾它
显式复制切片以确保后续调用Scanner.Scan
时不会覆盖数据。
input <- append(nil, newestBytes...)