Rocketpool Rescue Node as a service

TL;DR: I propose a permanent “Rescue Node Service” for Rocket Pool users to fall back upon during pruning, (re)syncing or while fixing technical issues in general. I believe the advantage would greatly outweigh the community’s cost.


Some background:

Until recently, I was running Besu/Nimbus on my home Raspberry Pi-setup. Post-merge, this was disastrous - resulting in almost exclusively missed attestations for days straight. People with similar setups (but also high-powered systems) were also suffering from this.

Luckily I was able to fall back on Poupas’ “Rescue Node” (paid by himself?), along with many other NO’s, while waiting for a Besu fix. Three weeks later, the situation is still troublesome and I’m switching back to Geth, while continuing to fall back on the rescue node.

This led me to the idea that such a setup could prove to be a very valuable community asset:
it would be available for large-scale or individual technical issues, but could also serve as a temporary fallback during pruning/resync for NO’s that don’t have a fallback setup.

Ideally, this setup would be funded by the community. I think the overall return of otherwise missed attestations & proposals would definitely cover the cost. Additionally, it might even convince new NO’s (profile: home-staker without fallback HW) to join Rocketpool as they have an extra option to minimize downtime and increase returns.

Self-service URL’s and time constraint seem helpful.

Variations/requirements are possible: e.g. client combos, part of SP (or not), no incurred penalties, NO for X time, …

To consider - and where I hope the community can pitch in:
Potential security or central-party risks, possible malicious actions by the rescue node operator (steal your MEV/tips), disclaimer for the NO, …

3 Likes

To add some colors to this, there are a few limitations to the current setup:

  1. Poupas is funding it, and may decide not to
  2. To use it, a rocket scientist or Joe has to give you a secret URL
  3. It doesn’t work for prysm users, and lighthouse users must disable doppelganger detection
  4. If you aren’t using mev-boost and aren’t in the smoothing pool, you can’t use the rescue node.

Resolving all of these will require enough funding for 3-4 VPS instances, one for each consensus client type (nimbus pending split mode), as well as a few days of developer time to iron DDoS protection, a security model that only allows rocket pool node operators to use it (but without us having to manually dole out URLs), and sufficient monitoring/tooling.

Ultimately I think around 350-400 RPL per year at today’s prices would be a reasonable amount- roughly half the cost would pay for the VPSes, and the other half would compensate its maintainer(s).

We collectively owe poupas a few months of backpay (probably 10-15 RPL in terms of costs he paid out of pocket).

I’m planning to write out a tech spec and circulate it privately at first but will publicly share it when I’m happy with it.

6 Likes

In support of this, it has been very helpful already to some NOs. We should take the load off poupas and have an official rescue Node.

1 Like

I’m for this, think it a great idea and reduces overhead. And reduces bandwidth for running another fallback node on home Lan.

Support, this saved me today!

I’m lost here.

What exactly does the rescue node do? How are you able to attest with it?

It’s essentially just a fallback node that anyone can use temporarily while they do maintenance on their primary cc/ec pair.

I am opposed to this idea.

  • It opens up everyone who uses the public node to the risk of priority fees being stolen. This isn’t just our own money, it’s the stakers’ money too. Having a public rescue node smells borderline reckless. Exploring Eth2: Stealing Inclusion Fees from Public Beacon Nodes | Symphonious

  • It goes directly against the idea of decentralization, and rewarding node operators for running their nodes reliably and well

1 Like

Yes let’s not do this, we just got rid of Infura.

We’re sensitive to both those concerns and plan to mitigate them -

Some form of authentication would be required. Tentatively we’re planning to require a signed message from the smartnode wallet to gain access. Further, we’re inspecting request bodies and rejecting validators who do not set their fee recipients to the correct address.

Finally, the rescue node credentials will be temporary, and there will be a cooldown period between issuing credentials for a given node operator.

Edited to add: if we’re concerned about tip/mev theft (and we should be!), the best place to start would be helping NOs using VPS with a ufw that is simply bypassed by docker’s own iptables rules… they are much more vulnerable than the rescue node is/will be

NOs that steal (or use a fallback client that allows someone else to steal) priorit fees should be penalized and stakers should be made whole. At that point it’s up to the NO to decide if it’s worth it to expose themselves to penalty risk in exchange for reduced offline time.

To be clear, the concern would be that you/the maintainers of the rescue node would be able to steal from NOs using the service. I don’t see how you are mitigating against that.

Based on yorick’s wording/link I think the concern was more centered on the notion that anyone with access to the /eth/v1/validator/prepare_beacon_proposer route could steal tips/mev, but that is fully mitigated against by the reverse proxy for rocket pool validators. Otherwise, 1 rescue node user would be able to steal the tips/mev of every other rescue node user (or worse- an outsider could).

It is true that the maintainers could steal tips/mev. There will be an explicit trust statement around that, and the tech spec (delivered soon) calls for the maintainers to be publicly listed, to the extent that they are willing to reveal their identities (anything from a discord handle to a CV in my view).

Here is the tech spec.

I believe this strikes an appropriate balance between decentralization, security, convenience, and usefulness.

A few parameters (GRACE_PERIOD_DAYS and TIMEOUT_DAYS) remain unspecified. We will want to pick values for these that give prospective users enough time to fix unforeseen issues with their nodes, but simultaneously don’t encourage them to be negligent.

My personal sense is that GRACE_PERIOD_DAYS should be half of TIMEOUT_DAYS. E.g, if GRACE_PERIOD_DAYS is 2 weeks, after those two weeks elapse you will be unable to use the node for another month.

This spec is deliberately flexible, things like GRACE_PERIOD_DAYS, TIMEOUT_DAYS, the list of proposed maintainers, and the cost/benefit analysis will come in a future grant application.

I’ve updated the spec to include new nice-to-have sections for checkpoint sync provider and treegen.