[Meta] Proposal for a Watchtower and Rewards Tree Change Process

jcrtp · 21 November 2022 05:47

Hi everybody,

I wanted to formalize a discussion that’s been happening on and off in the Discord into a post here. It kind of blurs the lines between Smartnode releases, project governance, Oracle DAO duties and responsibilities, and protocol DAO voting so I’m not really sure where to put it but this seems like a good place to start. As usual, I’ll provide a little background context at the beginning for people who are just catching up to the conversation before going into the details.

Context

Protocol upgrades for Rocket Pool involve upgrading the Smart Contracts on the Execution layer. Generally these upgrades are very large events that take months to coordinate, develop, test, audit, and finally deploy. As an example, Redstone was the first example of one such protocol upgrade - development on it began in January, and it was finally deployed to Mainnet in August. That’s seven months start-to-finish. Development on Atlas has reached its first feature freeze and we’re preparing to enter the audit phase; while its start-to-finish time will be shorter than Redstone, it will still be on the order of months.

That being said, our node operators typically don’t interact with the contracts directly by hand (barring a few special examples of highly-technical community members). They use our Smartnode software to do that interaction instead. The Smartnode has enjoyed a much different, much faster release cadence than the contracts. As another example: in times of high churn such as what we saw during the weeks leaading up to the Merge, I would push Smartnode updates once or twice per week to ensure everyone’s configuration was as Merge-ready as possible. In terms of node operators, this cadence is important: node operators need to have up-to-date setups and up-to-date client versions regularly, and they need breaking Smartnode bugs fixed quickly so they can run their nodes correctly.

However, the Smartnode fulfills a second duty that has been brought to the spotlight in recent weeks: it runs the “watchtower”, which is the container that handles all of the Oracle DAO responsibilities such as shuttling information from the Beacon Chain to the Execution Layer, checking for the Withdrawal Credential Exploit, and (more recently) generating and pushing the Redstone Rewards Merkle Tree artifacts.

Process Changes

Now that the Oracle DAO is expanding and is responsible for the calculation of rewards off-chain, the community has expressed a desire to apply the same kind of due diligence to the watchtower’s functionality as the contracts receive. That means:

Governance discussions, proposals, and votes around major changes to the rewards tree’s structure and how it is calculated.
Sufficient time to allow for multiple independent implementations to test those changes and ensure parity with each other.
Longer periods between watchtower update announcement and activation to give the community and Oracle DAO members time to assess changes, submit suggestions or pull requests for adjustments.

We think this is the correct way to go as well. However, we want to balance this in a way that doesn’t conflict with the regular cadence of updates and releases that regular node operators have come to expect. In the medium term it probably makes the most sense to spin the watchtower out of the Smartnode entirely and have a completely separate container / binary just for Oracle DAO duties so we can do both of the above things at the same time, but in the short term where this separation isn’t technically feasible we’d like to take the following approach:

Smartnode releases will continue as normal.
Watchtower-specific changes, instead of going live immediately with new Smartnode updates, will now obey predefined specific target slots for enacting those changes.
This is analogous to the way the Ethereum core devs specify target blocks for protocol-level hardforks, and client developers provide clients that have both pre-and-post behaviors built-in along with the block / slot to switch over.
This is notably aimed at breaking changes to the rewards specification (and all other watchtower duties / submissions) that would cause the watchtower to disagree with earlier versions, or to fundamental changes in the watchtower duties.
It would not apply to simple bug fixes that don’t cause the generated artifacts / submissions to disagree with previous versions. Those would be delivered with new Smartnode releases as they are done today.
The changes would be first presented in a post here and relayed in our Discord to allow for a proper discussion. If contentious, they should be run through our usual Snapshot voting process.
When due diligence has been done (whether it goes to a vote or not), we will select a target slot for making them go live and apply it to future Smartnode releases accordingly.

Next Steps

This process is an amalgamation of a few sources, so I’d like to get everyone’s feedback first before making it canon. If it looks like there’s general agreeance without any major blockers, I’ll put together the post for the first of these major changes: rewards tree specification v2 (which has admittedly already been developed and tested by @Peteris and myself, but not deployed because we want it to go through the above process first if that’s what everyone lands on).

Note that this post is not about the rewards spec changes; this is about whether or not the process described is sufficient to have a conversation about rewards spec v2. Please keep your thoughts focused on the process for now.

Thanks everyone!

Pieter · 21 November 2022 10:23

High-over, the process looks good to me. Some questions and clarifications:

So ‘Target slot’ here means a beacon chain slot specifically. At which point will target slots be predefined? Dependent on the discussion? Or more calendar-based where we have X slots per year at regular intervals and we aim for one of those?

I’m somewhat surprised to see ‘multiple independent implementations’ being mentioned. What does this refer to? Is this something actually being planned / expected from the oDAO?

This calls out the rewards specification / generated artifacts specifically. I’d say it should apply to all changes that can affect consensus of automated oDAO duties (as well as fundamental changes in duties.)

I’d like to see a dedicated oDAO discord channel for this. Perhaps even multiple channels, with one specifically for upgrades like these. (I’m not aware of the current structure of the oDAO channel(s) on Discord as they’re not public. Is there any news yet on when these will be opened up?)

ken · 21 November 2022 15:55

Overall I am supportive of the proposed process changes. I am interested in the answers to Pieter’s questions above.

mao · 21 November 2022 17:01

i like it, in favor.

Valdorff · 22 November 2022 02:11

This makes sense to me.

I’ll mostly steal bullets from @Pieter :

Does that imply the code is part of an update before the slot and behavior is based on current slot?
I like that there is time for review and implementation by others. Even if this isn’t done for each feature, allowing for it is important.
Agree that wording should be generalized to be about consensus writ large
Agree that seeing open oDAO comms would be very valuable

And a couple of bullets of my own:

What is medium term here? Are we talking Q1/Q2 2023 to split watchtower out into its own container?
What is a rough idea of “long enough” for review and implementation? I know the high end is effectively uncapped, but we should be able to set a floor and go up with complexity. Two weeks for something “trivial”?

jcrtp · 22 November 2022 05:42

Hey Pieter, I’ll answer inline below:

So ‘Target slot’ here means a beacon chain slot specifically. At which point will target slots be predefined? Dependent on the discussion? Or more calendar-based where we have X slots per year at regular intervals and we aim for one of those?

It’s going to be a bit of both, I think. We know that the rewards intervals happen exactly 28 days apart, so for rewards tree-related changes, that is a pretty natural interval to target when incorporating changes. For non-rewards-tree things like other watchtower duties, we can either target a specific slot on the Beacon Chain or roll them up into a rewards interval boundary too. Either way we will end up targeting a specific slot under the hood. I want to leave room for out-of-band changes that don’t target one of the reward intervals just in case though, and I think those would come out of the conversation around the update in question and its urgency / priority.

I’m somewhat surprised to see ‘multiple independent implementations’ being mentioned. What does this refer to? Is this something actually being planned / expected from the oDAO?

This refers to the fact that @Peteris followed the rewards tree / calculation specs posted in the research repo and produced an implementation of the rewards tree generation on his own that is entirely independent from the Smartnode. He uses it for Rocketscan, and it’s an amazing tool we use to reference check each other’s implementations. His implementation is how we found the bug in the current watchtower that counts minipools that have staked but haven’t been seen on Beacon yet.

This calls out the rewards specification / generated artifacts specifically. I’d say it should apply to all changes that can affect consensus of automated oDAO duties (as well as fundamental changes in duties.)

Yep, that’s a totally fair clarification. The intent is for it to apply to all of the watchtower duties, such as rETH balance reporting - I will amend the post to state that.

I’d like to see a dedicated oDAO discord channel for this. Perhaps even multiple channels, with one specifically for upgrades like these. (I’m not aware of the current structure of the oDAO channel(s) on Discord as they’re not public. Is there any news yet on when these will be opened up?)

That’s a Dave / Langers question, so I’d ping them separately about this. I definitely agree that it would be good to have some Oracle DAO member insight into these changes and I’ll try to get them to engage with these forum posts on relevant change proposal topics.

jcrtp · 22 November 2022 05:48

More inline answers below:

Does that imply the code is part of an update before the slot and behavior is based on current slot?

Yes. The code would be included in a given Smartnode release, but would not be activated until a specific Beacon slot was reached (and probably finalized). It would use the legacy behavior until that slot.

What is medium term here? Are we talking Q1/Q2 2023 to split watchtower out into its own container?

My gut feel is something like Q3 or Q4 2023. If this works, it’ll work well enough that I can focus on other higher priorities until I can come back to this as part of a larger Smartnode overhaul. I think of the short term as months, the medium term as a year or so, and the long term as 2+ years.

What is a rough idea of “long enough” for review and implementation? I know the high end is effectively uncapped, but we should be able to set a floor and go up with complexity. Two weeks for something “trivial”?

I think two weeks is acceptable and three is probably preferable. I wouldn’t go longer than that for most changes unless they were absolutely radical implementation changes that really needed special scrutiny. That’s just me though; I defer to you guys to help me decide how long is long enough, and it might just end up depending on the scope of the changes being proposed anyway.

Wander · 22 November 2022 06:07

In general, this is a well considered suggestion, and I agree the process makes sense. It’s important to make sure the off-chain computation scheme remains reviewable yet flexible.

We should probably define this more specifically

jcrtp · 22 November 2022 06:17

We should probably define this more specifically

Fair enough, what I had in mind was basically “a change that wasn’t universally approved by everybody after a few weeks of scrutiny and caused a legitimate disagreement that couldn’t be overcome with discussion alone, so it had to be put to a vote in order to move forward”.

Butta · 22 November 2022 08:17

Just to say it on the forum as well - I support it.
Thanks joe!

Good questions by @Pieter , @Valdorff & @Wander. I like the direction this is heading

yorickdowne · 22 November 2022 09:19

General thumbs-up. An activation slot for changes that affect consensus just makes sense.

a35u · 23 November 2022 04:32

This sounds like a good plan. Thanks

Pieter · 23 November 2022 08:05

Got it. As you say, Peteris’ implementation is external to the smartnode and thus not strictly speaking in the oDAO domain. So it doesn’t actually influence oDAO consensus (and risk of consensus breaking) directly. It does serve as a valuable early warning system though, so I agree it should be taken into account when proposing upgrades.

(In my first reading I was thinking along the lines of the oDAO itself having ‘pluggable’ alternative implementations for the rewards spec / other duties in the Watchtower, like running different beacon consensus clients. But that’s a whole different order of complexity.)

No further comments, I’m in support of moving ahead with this.

yorickdowne · 23 November 2022 10:11

like running different beacon consensus clients.

Which we do. Treegen ahead of time can find issues. For example for my own node, treegen highlighted that I needed to let Lighthouse backfill to an archive node; resync Teku from scratch for Archive; and set the server timeout in haproxy to 120s. Erigon Archive worked well as did the Alchemy failover.

This got done in time for the current rewards interval closing. With a change in watchtower version, I’d want to repeat the test, and ideally repeat it both with the primary Lighthouse / Erigon archive as well as the failover Teku / Geth pruned / Alchemy archive.

Pieter · 23 November 2022 10:29

I didn’t mean to imply the oDAO doesn’t run different beacon chain consensus clients.
I meant interchangable rewards spec logic in the smart node, similar in concept to using different beacon chain clients.

Or do you mean you’re actually using Peteris’ implementation in your testing as well?

yorickdowne · 23 November 2022 11:20

I have only used Joe’s treegen for my testing. Broadly I see two things that can be tested:

Is my setup doing the thing. Treegen is useful here. Every oDAO member should do this.
Is Treegen correctly implemented. That’s where checking against other implementations comes in. That work doesn’t have to be repeated by every oDAO member. I am happy to have Joe do this cross-check, since he is already.

jcrtp · 23 November 2022 18:52

Alright, it looks like this has some good support behind it so I’ll press ahead. I’ll write up a proposal for rewards spec v2 some time tomorrow (got stuck working on solo staker migration today).