As far as I know, on both recent AMD and Intel chips, prefetch
instructions can retire before the associated data arrives. That is, unlike loads, retirement doesn't depend on the arrival of the associated data in the requested cache level1.
Assuming I issue a series of prefetch instructions and now I want to wait for the data to arrive before proceeding, is there any way to do it? It doesn't seem like lfence
will work since the instruction can retire even if the data hasn't arrived.
1 There does seem to be a significant difference relating to how Intel and AMD chips handle execution of prefetch instructions: Intel will always execute the prefetch instruction, and so will block if resources (such as fill buffers) are not available. AMD chips on the other hand, seem to only execute the prefetch instruction if resources are available: otherwise, the prefetch may simply be dropped. Both strategies have their merits depending on the code and access pattern.