Automatic Challenge Response Rework

Valdorff · 11 April 2023 02:34

This is a proposed improvement to be included in the next significant smart contract release. I’m looking to get a vibe check first, and will then make a formal RPIP if appropriate.

Current state:

Any oDAO member can be challenged by any other oDAO member
Any oDAO member can be challenged by the public at large, but this costs 1 ETH (the ETH is lost to all parties)
The oDAO node has 7 days to respond
The watchtower (the portion of the smartnode stack that’s oDAO specific) automatically responds to any challenge

This essentially makes it a check that either: a computer is running, OR a human is observing and manually responds. The biggest pain point here is something like a very stable computer setup where the admin is MIA - a challenge wouldn’t work. Essentially, it feels like a deadman’s switch with a brick on it.

Proposed state:

A challenge from an oDAO member gets an additional string argument
A challenge from the public can pass in a string argument, but if it’s non-empty the oDAO node has an additional 7 days to respond
A successful response must match the string argument, if non-empty. If the string argument is empty, that check is skipped.
The watchtower will automatically respond with the current version as the string

I think this strikes a good balance of having the deadman’s switch function. A nefarious oDAO node can totally rewrite code to defeat this, but I think that’s a different class of issues and we should be able to check that the system works in well-meaning cases.

Let’s get a community sentiment poll

I support this
I don’t support this
Undecided on the general concept
Waiting on a concrete RPIP

0 voters

FAQ

What if some rando challenges with a random string like “getRekt”?
- The oDAO node would have 2 full weeks to manually respond to the challenge
- If the oDAO judge that there was a good reason for the kicked node to not be able to respond in that time (tough to imagine, but maybe), they can simply vote the oDAO member back in
What if an oDAO member challenges with a random string like “getRekt”?
- oDAO members are significantly trusted already, so I feel it’s reasonable to defend less
- I would 100% consider this a kickable offense for the griefing oDAO member
- If folks are concerned, we could make it add the extra 7 days to any non-empty string call, even if it came from oDAO

Pieter · 12 April 2023 12:54

It’s not entirely clear to me what problem the additional string argument aims to solve. In my view, the challenge feature is a last-resort option for removing a ‘dead’ oDAO member that’s not performing its automated duties. That’s subtly different from checking the node is being actively managed by a person.
The protocol is mostly interested in the automated duties, and even though the RP community may want to verify active management, I’m not sure if the automatic challenge is the place to do that. After all, why raise a challenge in the first place if the node is performing its automated duties correctly? Or do you envision this as a challenge for e.g. ‘zombie oDAO members’ who carry out passive duties but never vote on contract upgrades and don’t respond to communication ? Ostensibly, it would be noncontroversial for other oDAO members to vote out such a member.)

As you say, a nefarious oDAO member could trivially defeat the check. But maybe there is still some value for the degenerate case where the node successfully responds to a challenge, but still somehow fails to carry out its automated duties (still no guarantee an active admin would/could fix those duties, but at least it gets someone to look.)

Overall it feels like the expectations for the challenge feature are just not that clear (just automated duties vs. broader, active involvement.) Perhaps this could be a discussion point for the oDAO constitution.

Valdorff · 12 April 2023 13:58

If an oDAO node isn’t voting on upgrades, it’s not doing its job. If it’s performing automated duties with an out of date spec (eg, merkle tree; the “degenerate case” you mentioned happens quite regularly), it’s not doing its job. That said, in such cases the oDAO can still vote to kick someone, so this actually isn’t that big a deal.

However - it is a big deal if oDAO cannot arrive at consensus to kick someone. Here challenges are the only way to restore a functional oDAO – that is why they exist. As a thought experiment, consider 51% of the oDAO fall into an enchanted slumber, but their computers continue functioning. The current challenge with an automated response would prevent any recovery.

knoshua · 12 April 2023 18:46

This is the intention according to the source code

// In the event that the majority/all of members go offline permanently and no more proposals could be passed, a current member or a regular node can 'challenge' a DAO members node to respond
// If it does not respond in the given window, it can be removed as a member. The one who removes the member after the challenge isn't met, must be another node other than the proposer to provide some oversight
// This should only be used in an emergency situation to recover the DAO. Members that need removing when consensus is still viable, should be done via the 'kick' method.

The problem here is that a “dead” oDAO member can stop performing it’s duties while the smartnode keeps on running and responding to challenges. For example, automated duties require software updates from time to time as we have seen with rETH price submission bugs or new reward tree versions. We’ve also seen an oDAO node be online (and ostensibly able to respond to challenges) but not able to execute automated duties. Contract upgrades would also be impossible with a majority of “dead” oDAO members that keep auto-responding to challenges.

I don’t see a lot of value in the challenge system as is and I see the automated response as the main issue. If there are concerns about requiring active monitoring of the oDAO node or oDAO members being griefed, I’d look at the challenge length and cost. But having to check in on your node every 14 days does not seem unreasonable to me to be honest.

Pieter · 12 April 2023 19:39

Okay, so the idea here actually is emergency consensus recovery and not a fallback check on performing the baseline automated duties expected of an oDAO node. With that goal in mind I agree an automated response does not make sense. An automated check can at most provide proof that there is a node online responding to something. An oDAO node does not vote on upgrade proposals, the operator running that node does.

Then, why not do away with the default automated response entirely and require interactivity (whether it be a string argument or some other flavor requiring operator action?) I don’t see a big griefing risk either. Sure, some extra notification about having to respond to a challenge could be nice, but in practice a lot of communication would have already gone out before a challenge is raised. To compensate, a longer response window would be fine.

Maybe the ‘version check’ automated reply (or something including some sort of proof-of-succesful duty) could be useful in a new, lighter class of challenge. Where the aim is not to restore oDAO consensus, but to prod for correct execution of automatic duties. Penalties could be something like reduced RPL rewards for a period or bond slash. This is a different topic though and there may be better ways to solve that.

Valdorff · 14 April 2023 21:22

Draft RPIP - feedback welcome: RPIPs/RPIP-draft.md at feature/challenge_rework · Valdorff/RPIPs · GitHub

langers · 19 April 2023 06:22

How does the version check work? Not sure I understand how it ensures the node is up-to-date.

you mention that an ODAO member can provide an additional string, is that the version they are running?
is the challengee responding with their version? or echoing the version in the challenge?
if the public are passing in a string argument, is the watchtower just echoing that? or sending its version?

Basically how do we know they are running the right version?

It seems to me that we just need greater transparency on versions rather than building it into the challenge system.

Valdorff · 19 April 2023 15:45

Imo, the key thing is the deadman’s switch. I believe the current auto-response system has broken that critical aspect in favor of proof-of-liveness. If we’re willing to lose proof of liveness, we can simply remove the automated response code from the watchtower entirely.

The system I’m proposing gets these 3 uses:

Deadman’s switch challenge ← this is the important one

actionChallengeMake(targetNodeAddress, “Is there anyone home?”)
On the target node, the watchtower responds with their version
- Eg, actionChallengeDecide(ownAddress, “1.9.2”)
- The challenge remains active as the string doesn’t match
A human is required to echo the string in a challenge response. This is what makes it function as a deadman’s switch. This would have a 1-week response time when originating from an oDAO node, or a 2-week response time otherwise.

Proof of liveness (with version query)

actionChallengeMake(targetNodeAddress, “”)
On the target node, the watchtower responds with their version
- Eg, actionChallengeDecide(ownAddress, “1.9.2”)
The challenge is resolved as it got a response when the string was empty
There is a 1-week response time allowed regardless of the originator (because the string is empty).

Version challenge

actionChallengeMake(targetNodeAddress, “1.9.3”)
On the target node, the watchtower responds with their version
- Eg, actionChallengeDecide(ownAddress, “1.9.2”)
- Eg, actionChallengeDecide(ownAddress, “1.9.3”)
In the first example the challenge remains active (ie the node is at risk of being kicked) because they have an outdated version. In the second example, the challenge is resolved as it got a matching version.
This would have a 1-week response time when originating from an oDAO node, or a 2-week response time otherwise.

Darkmessage · 20 April 2023 05:18

Any oDAO member can be challenged by the public at large

The watchtower will automatically respond with the current version as the string

actionChallengeMake(targetNodeAddress, “999.999.999”)

Will be identified as version challenge to which the watchtower can never match the version and the node gets kicked?

How is this prevented?

knoshua · 20 April 2023 09:28

this is the Deadman’s switch challenge Val talks about above. The oDAO NO (a human) needs to respond with the correct string.

Darkmessage · 20 April 2023 10:57

I’m saying that since the dead man switch has the same syntax as the version check, the version check cannot be done. They would need different commands.

knoshua · 20 April 2023 11:46

The smartnode would try to respond with the version, creating a transaction and simulating it. If a different string is used like in your example, the simulated transaction responding with version would revert. The smartnode would then not submit this transaction and alert the operator that a manual response is required. If the simulation doesn’t fail, smartnode would submit and the version check is passed.

Darkmessage · 20 April 2023 13:39

But this is no longer a version check.

Scenario:

I challenge with version “1.9.2”
The DAO node still runs version 1.7.0 so it can’t satisfy the challenge.
The person operating the node would manually reply with “1.9.2” to satisfy the challenge
The node is still on version 1.7.0

So what started as a version check became a dead man switch with no assurance to the version the DAO node runs.

For an actual version check we would need a special command challengeVersion and respondVersion which both ensure the version cannot be manually tempered with.

Valdorff · 21 April 2023 00:16

The oDAO members are trusted parties already. If an oDAO member is willing to purposefully defeat transparency systems, we have big fish to fry.

The deadman’s switch is trustless. The version check thing is trustful. If an oDAO member is going out of their way to lie instead of updating, they should immediately be kicked and I don’t think anyone will defend them.

langers · 21 April 2023 01:45

So the version check doesn’t seem that effective and I think it is better to promote greater transparency than incorporate it into the challenge system.

We are now over $1.3 billion TVL, so any change that has a security footprint has to be considered extremely carefully. Having a manual response provides a window for griefing ODAO member/s - I know they pay for it but it is a relatively inexpensive way to grief critical infrastructure.

i maintain that greater transparency and accountability for the ODAO is the way forward. They are all public entities by design and so that is the soft-spot for correcting behaviour. The ODAO need to get better at it but self-governance will play an important role.

The pDAO can still exercise strong influence over non-performant ODAO members, if given the transparency to do so.

Valdorff · 21 April 2023 04:03

Forget the version part for a second.

The deadman’s switch is an important failsafe in case the oDAO cannot reach consensus. It is not griefing to be able to kick unresponsive members, it is vital. That’s what the code comment says, and I agree. Today, the watchtower disables this failsafe and that should be fixed.

langers · 21 April 2023 04:34

At the time the challenge system was designed the intention was good, particularly when the ODAO has a small number of members.

As we have onboarded more members the likelihood of the ODAO not reaching quorum because members are AWOL has reduced significantly. The chance of over 51% of the ODAO going AWOL is extremely unlikely. Especially considering that most ODAO seats have multiple personnel who can access the ODAO nodes to ensure operational effectiveness.

Valdorff · 21 April 2023 05:04

attn: @Darkmessage @knoshua @Pieter

Ok. I think this begins to make sense to me. Essentially you’re saying “the original intent (as commented) is obsolete”.

If we agree with that, then I think we should do away with the challenge entirely rather than having the watchtower mostly remove it. I would suggest setting members.challenge.window to, say, 1e15 seconds (31 million years). [note that this setting is an oDAO setting]

Without the core purpose behind this proposal (the deadman’s switch), the version stuff I described isn’t worth implementing.

I think I’m personally convinced by the context langers provided around having >1 bus factor for most oDAO nodes. I think the risk becomes vanishingly small.

We should get a working deadman’s switch
We should fully disable challenges
We should keep things as they are with a challenge that’s automatically refuted by the watchtower
Something else see my comment below

0 voters

a35u · 22 April 2023 16:54

If we can’t trust the other odao members to remove an unresponsive member then we have bigger problems. I would like to see some kind of liveness check that all nodes participate in such as regularly generating partial trees.

Valdorff · 24 April 2023 03:56

I think this survey shows enough to move forward, but I’m going to suggest not doing so right this moment. My reasoning for that is that there’s an oDAO constitution being worked on (disclosure, I’m part of that effort), and it may help clarify whether an oDAO settings change is something the pDAO should vote about at all or not. The original concept was certainly a pDAO vote (SC changes), but it’s not clear if that’s still true.

Since both leading options are similar, with the challenge partly or fully disabled, I don’t think a delay would be harmful.