Interval 22 post-mortem

Patches · 9 May 2024 17:21

Disclaimer first:

What follows is my interpretation of the events that led to interval 22 producing what social consensus has deemed a bugged tree
Financial impact is so trivial that even if we set up a way to correct it, the gas cost of doing so would greatly exceed the the remuneration
RPL rewards were correct, only the smoothing pool was impacted
Voting against v9 will not fix anything, this issue is present in all previous versions

Interval 22 was just completed and, as @langers mentioned in the protocol channel on discord:

rewards are safe but the calculation doesn’t align with the v8 spec for this interval (22), impact on node operators is insignificant (~0.000001 ETH per minipool opted into the Smoothing Pool) but we are letting you know.

The disparity arises from a few attestations which were included more than 32 slots after their assignment. These attestations arrived late, and the rolling-records implementation of treegen in watchtower treated them as missed, while the non-rolling-records implementation of treegen in watchtower and @ramana 's independent implementation treated them as valid.

Timeline of events

The treegen spec in its original form didn’t establish a deadline for attestations, it established a methodology for counting attestations that set an implicit limit to one epoch.
v6 updated this methodology to make the boundary explicitly inclusive
Joe created the rolling-records implementation to enforce that deadline, but forgot to update the non-rolling-records implementation
Deneb included EIP-7045: Increase max attestation inclusion slot which extended the deadline for attestation inclusion, making the non-rolling-records implementation trivially aligned with the beacon spec, but leaving the rolling-records implementation and ramana’s implementation correct per the rewards spec - albeit tacitly diverging from social consensus and the beacon spec
A preview tree detected the issue, but Ramana nor I realized the spec still specified a 32 slot deadline, so Ramana updated his implementation to match the non-rolling-records implementation
An unrelated bug in treegen forced the oDAO off of rolling records for interval 21, so the interval 21 tree actually diverged from the spec, but matched ‘social consensus’ and the beacon spec
The bug in step 6 was fixed and the oDAO switched back to rolling-records for interval 22 and produced a tree that matched the spec but not social consensus nor the beacon spec

The result of this is that no matter which specification you believe is correct (the one that follows the written language of the v8 RPIP, or the one that follows the unwritten social consensus on how to handle late attestations), at least one interval was finalized with the other methodology.

Updating the spec

The spec says:

Look at the attestations in the subsequent blocks with matching slotIndex, committeeIndex, and position. Start at the block directly after slotIndex, and look up to 1 Epoch away (BeaconConfig.SlotsPerEpoch) from slotIndex.

If one was recorded in any of these blocks, this attestation was successful. Calculate the minipoolScore for this attestation as described below.

If the attestation was not found, it was missed. Add it to a running list of missedAttestations.

The boundary is inclusive, so if an attestation for slot n is found in slot n + BeaconConfig.SlotsPerEpoch then it was successful. If it was found in slot n + BeaconConfig.SlotsPerEpoch + 1, it is too late and should be considered missed.

This is, in my view, overly descriptive, and we can simply amend the spec to say that attestations are considered valid if they are valid for the fork in which they’re assigned, with an explanation that EIP-7045 in Deneb changed the criteria from a 32 slot deadline to a deadline at the end of the epoch following the assignment.

Please opine on how you think this should be corrected in the following sentiment poll:

Clarify attestation inclusion rules as errata in RPIP-51 (v8) without polling the full pDAO.
Write and pass a new RPIP and a new version of the rewards tree through the RPIP process.

0 voters

I am explicitly in favor of amending the RPIPs via errata. We can have this done in a day.

The alternative is:

Correcting treegen to consistently follow the spec in both implementations (quick)
Passing an RPIP to create a new version of the rewards spec that updates attestation criteria (3 weeks)
Releasing a new version of watchtower to the odao (2 to 6 weeks depending on the timing)

Given the financial scope of the change, if we can’t fix this via errata, I would probably prefer correcting treegen to match the spec as-is and wait until a v10 proposal is created for some other reason and piggy-back the attestation eligibility changes into that proposal.

Keep in mind that even if you are an absolute purist, it is too late to amend intervals 22 and 21 for consistency. If we have to go the new-RPIP route, the benefits of a clean history will not be retroactive.

langers · 9 May 2024 23:18

Thank you Patches, great write up.

knoshua · 10 May 2024 08:55

This makes sense to me.

Patches · 17 May 2024 17:53

I’m closing the poll and based on 89% assent I’m not willing to push through an errata change.

Unanimous was a high bar but honestly the “don’t do anything, fix in unrelated v10” is going to be the path of least resistance, and avoids establishing any sort of precedent. Smartnode has already been updated to match the language of the spec (respecting a 32 slot limit for inclusion).

I apologize in advance to whomever I victimize when a v10 proposal rolls around.