POST-MORTEM - Security Incident (26/05)

langers · 3 June 2022 00:03

Incident Summary

On 26 May 2022 between 16:33 - 16:43 UTC, two Oracle DAO nodes (ODAO), operated by the Rocket Pool team were compromised. ETH and RPL were stolen from the node accounts.

A team member’s workstation was hijacked using a remote execution exploit. Two unencrypted ssh keys were present on the team member’s workstation and allowed access to the two ODAO nodes.

The attacker gained access to the two ODAO node private keys and drained the accounts of funds.

This is an embarrassing and expensive lesson for the team and we are establishing improved systems and processes to harden operational security. Thankfully, due to the distributed design chosen for the ODAO, the protocol was not put under any risk.

It is also a stark reminder to our community to remain vigilant.

Detection

The first responder, while working with an open telemetry dashboard noticed that two of the ODAO balances jumped up, then dropped down. On witnessing the balance irregularity, they decided to investigate immediately.

Concurrently, automated alerts fired warning that two of the ODAO node balances had dropped below an ETH threshold.

The first responder checked the ODAO node balances on Etherscan and confirmed that ETH and RPL had been removed from the nodes.

Detection of the incident was quite quick, both from a first responder and automated alerting perspective. A faster detection could be achieved by an intrusion detection system or detecting suspicious transactions. ODAO nodes perform particular transactions and so any other transaction is suspicious. In this case, it may have enabled detection before the ETH was transferred but it is unlikely to have prevented the impact. Ultimately, prevention is the key here.

Response

On confirming the incident, the first responder immediately escalated to the incident manager. Together they triaged the incident, to identify classification but more specifically whether there was a threat to node operators and other ODAO members.

The threat appeared localised because only two of the team’s ODAO nodes were affected but proceeding with caution the incident was initially reported to the ODAO. Further evidence was gathered to determine whether node operators were under threat.

Containment measures were put in place to protect the remaining team ODAO nodes. Firewall rules were updated to deny all network connections.

Other team members, were raised and it became apparent that a team member’s workstation was compromised. Containment measures were put in place to isolate the workstation.

The ODAO was informed that the root cause had been discovered and that it was a isolated incident.

Public incident communication was drafted and sent.

A recovery plan was determined and the ODAO informed that instructions will follow.

There was a delay in putting in place some containment measures because of an unplanned internet outage that affected one of the team members. A backup connection was established but because of the security IP locking policy, containment was delayed. Due to the infrequency of these issues and the positive security benefit of IP restrictions the trade off is worth it.

A key contact list was available but reaching team members was hampered due to it being very early in the morning local time.

Impact

External Impact

The Rocket Pool protocol was unaffected by this incident.

No node operators were affected
No other ODAO members were affected
The protocol continued running perfectly
The ODAO system requires a 51% consensus on its actions and so is robust under these sorts of isolated impact.

Internal Impact

In total ~14.75 ETH (of ETH and RPL) was stolen from the affected ODAO nodes.
Two of the team’s ODAO nodes are now not functioning because the accounts are compromised (and so cannot be sent ETH)

Recovery

The incident is still in recovery mode but the initial impact as been contained.

The affected server keys have been rotated
There is no evidence that other servers were exposed but, as a precaution, all other servers have had their keys rotated
A proposal to kick the affected ODAO nodes has been submitted
A plan is being executed to recover the RPL bonds from the affected ODAO nodes

Once the affected ODAO nodes have been replaced, we will consider the incident resolved - this will take a couple weeks due to how ODAO voting works.

Timeline

All times are UTC

Date (UTC)	Detail
17 May, 12:40	Team member retrieved the ODAO 1 and ODAO 3 ssh keys to diagnosis a late night urgent issue with ODAO consensus. The ODAO were not in consensus for balances and so deposits were not possible for a couple of hours. The ssh keys were retrieved from our 2FA protected shared password manager. Unfortunately, the keys were unencrypted by default (no password) and due to haste they did not encrypt the key or remove them once finished.
17 May, 12:49	The team member identified the ODAO consensus issue and resolved it quickly to restore ODAO consensus then worked with other ODAO members and the team to make sure it didn’t happen again.
Unknown	Team member’s workstation infected by remote exploit. The team member is extremely careful and restricts what they install but being a Windows workstation could have been a contributing factor.
27 May, 11:30	Team member noticed that their hard drive was at 100%, they closed some applications and it returned to normal. We believe they were scanning for keys.
27 May, 16:07-16:33	Attacker used the team member’s unlocked Metamask to swap the team member’s tokens for ETH and transferred to an attacker account.
27 May, 16:33	Attacker gained access to ODAO1 node using the unencrypted ssh key, extracted the private key from the wallet file and transferred its ETH balance to the attacker’s account.
27 May, 16:36	Attacker gained access to ODAO3 node using the unencrypted ssh key, extracted the private key from the wallet file, and swapped its RPL and transfer its ETH balance to the attacker’s account.
27 May, 16:43	Attacker found a private key file on the team member’s workstation that was used for a personal bot and transferred its ETH balance to the attacker’s account.
27 May, 16:45	Attacker revisited the first Metamask account it drained to empty the last of the ETH.
27 May, 16:51	Attacker sold some of the ETH for BTC on an exchange
27 May, 16:54	Incident discovered - Automated alert warned that ETH balance on ODAO3 was below threshold
27 May, 16:59	Attacker sold the rest of the ETH for BTC on an exchange
27 May, 17:24	Incident investigated/escalated - First Responder checked the ODAO accounts and realised they had been emptied and immediately elevated to Incident Manager
27 May, 17:36	Automated alert warned that ETH balance on ODAO1 was below threshold
27 May, 18:00	Incident classified - as high; evidence suggested an isolated incident as no other ODAO members were affected. Did not want to rule anything out so gathered evidence to ensure node operators were not under threat.
27 May, 18:37	Incident reported to ODAO - Brought incident to ODAOs attention, we are investigating
27 May, 20:22	Incident containment applied - ODAO firewall rules updated, just in case it was a network based exploit
27 May, 20:57	Discovered team member machine compromised, reviewed repercussions
27 May, 21:18	ODAO updated on incident - Root cause discovered, no issue with smart node stack
27 May, 22:53	Rocket Pool Discord community informed of incident
27 May, 23:25	Formulated plan to kick compromised nodes and recover RPL bond
27 May, 23:29	Planned and started key rotation
27 May, 00:06	ODAO updated on incident - Will provide instructions soon about recovery plan
31 May	Incident review conducted
3 June	Incident post-mortem published

Root Cause

Question	Answer
What happened?	ODAO accounts compromised and drained of ETH and RPL
Why did that happen?	Because the ODAO nodes are hot wallets and have to have the Ethereum keys available to perform their duties.
Why did that happen?	Because the ODAO node needs ETH to function and one of the ODAO node’s withdrawal addresses was not set. So claimed RPL was in the node’s hot wallet.
Why did that happen?	Because an intruder was able to ssh into two of the ODAO nodes
Why did that happen?	Because they were able to access an unencrypted ssh key
Why did that happen?	Because we stored unencrypted ssh keys on a privileged access workstation
Why was this a problem?	Because a remote access exploit was used to compromise a privileged access workstation
What can we learn from this?	Assume workstations can be compromised at any time
What can we learn from this?	SSH keys should have a password by default
What can we learn from this?	Always set the nodes withdrawal address
What can we learn from this?	2FA would have prevented the access
What can we learn from this?	Consider sandboxing development machines using encrypted VMs
What work could we do to make sure the incident does not happen again?	Roll out 2FA sign in to all ODAO nodes
What monitoring can we put in place to identify the issue sooner?	Monitoring was good: ETH balance alert fired, although could have been quicker
Are there proactive measures we can put in place?	ODAO protocol design: separate hot and cold wallet, improves recovery
Are there proactive measures we can put in place?	Expand the ODAO: introduce new members to further reduce risk

Lessons Learnt

What went well?

Telemetry and automated alerting was effective
The team reacted quickly and worked effectively under immense pressure
Incident management process preparation proved valuable
Incident communication with the ODAO was prompt and continuous

What areas are there to improve?

Initial communication with the community should have been sooner.
Containment was hampered by unforeseen situation but it should have been more applied quicker

Corrective Actions

Action	Assigned	Complete
Confirm withdrawal address set on all ODAO nodes	Yes	Yes
SSH key rotation	Yes	Yes
SSH keys encrypted (has password) by default in shared password manager	Yes	Yes
Apply 2FA on SSH to ODAO nodes	Yes	No
Expand ODAO	Yes	No

Bertram · 3 June 2022 10:28

Thank you for the detailed postmortem. It’s good to see that you’ve decided on some mitigations. However, I still have some questions.

You have not defined any mitigations with regards to the root cause. (The infected workstation.) Have you discovered how the workstation got infected? What will happen to the workstation now? And how will you prevent such infections in the future?
Why is a shared password manager used? Individual accounts guarded by 2FA (e.g. yubikey) is industry standard. This greatly limits the scope of infected machines. I’d recommend getting rid of the shared password manager all together. (At the minimum for SSH keys)
You state that you have rotated keys of the infected nodes. Is this the only action with regards to the compromised nodes? I would recommend completely wiping them, since this is the only way to ensure that they are no longer compromised.

With regards to the ODAO. I think that you understate the significance of the attack. It seems that the attacker was only after some quick ETH, and not after disruption of the protocol, which limited the impact.

However, a more sophisticated attacker could have stealthily infected your network (and your other ODAO nodes). This could have resulted in 30% of the ODAO being compromised, which would bring an attacker dangerously close to a majority, just by attacking one entity.

It seems that you understand this significance. One of the corrective actions stated is: Expand the ODAO. You also state that this action has been Assigned.

However, earlier you state:

Once the affected ODAO nodes have been replaced, we will consider the incident resolved"

It therefor seems that you do not consider this expansion to be in scope of this incident.

Could you please elaborate on this?

If not in scope, then what priority have you given to expanding the ODAO?
What is the current state of this task?

As you probably know, there is discussion on this subject in this thread

langers · 3 June 2022 10:52

Hi Bertram,

Thank you for taking the time to ask those questions, they are good questions.

Unfortunately it is extremely difficult to pinpoint how it exactly got infected. The infected workstation will be rebuilt as a Linux based system rather than a Windows system. Obviously this doesn’t preclude it getting infected again but it may have been a contributing factor and there is a better design surface for mitigations. One such mitigation, highlighted in the post-mortem is to separate activities into multiple VMs. This sort of exploit is sadly quite common and so the only effective strategy is to assume it can be compromised at any time and ensure the process is resilient to it. This is why we are focusing heavily on hardware/2FA for ssh.

Agreed, when we move to a hardware approach we will probably not need the password manager for the ssh keys at all. We rotated the keys as the first step because we want to consider the best hardware/2fa approach.

No, currently they are just isolated just in case we need them for further investigation but they will be wiped / terminated shortly.

Bringing on new ODAO members will be a progressive endeavour because it takes time to lock them in, present them to the community, and onboard them. I can assure you that it is a very high priority for us. We are in discussion with several new potential ODAO members and will be stepping up the communications with them from next week to get them on board.

knoshua · 4 June 2022 22:03

Have you considered getting someone to do forensics on this?

langers · 6 June 2022 07:29

This is something we are considering, yes. If we do go down this route and we discover anything that would benefit the community we will let people know.

Fitz · 7 June 2022 00:03

Great root cause analysis and thanks for being so transparent. Better we learn and build stronger now than later down the line when RP is much bigger.
Are there any other parts of the RP infrastructure where you have the same security protocols and thus similar attack vector? (shared password manager, windows pc’s etc) if so please confirm you will be hardening them also if not done so already. Thanks

langers · 7 June 2022 01:33

Hey @Fitz, thank you for the support.

We have rotated keys across the entire RP infrastructure, as a precautionary measure. We will apply the new stricter security protocols across all RP infrastructure, not just the ODAO nodes.

We will review the shared password manager but we have to share secrets somehow, these are industry standard platforms but there is still a risk. One thing we can do is split the secrets across multiple providers but we have to evaluate whether this increases the risk or not. We are also looking at having different levels of secret with different protocols associated.

No team member is using a windows workstation not and we have all beefed up our personal workstation and network security.

BLinc117.eth · 8 June 2022 02:03

@langers - is the need for using a secret manager because the way the smart node is designed it requirers operators to login using the account it was installed from? I havent played with trying to access it from other accounts yet, but it has been on my mind. If this is the case, it is something we should put on the roadmap for sure as its critical for all NOs who want to have someone who can be their backup if they are unavailable (or no longer among the living for ex). Its critically important for oDAO nodes. I am actually a bit shocked that using shared credentials didnt come up in any of the audits as its a huge security no-no.

My other question is on the composition of the oDAO members. I assume there is a policy in place that prevents one organization from operating multiple oDAO nodes? It is understanding why RocketPool operates 2 being the maintainer, and I am sure the original oDAO members.

Also most boards generally strive to be an odd number so you can never end up in a 50/50 split. The small number of nodes also makes the protocol easier to attack, as you only need to either compromise a half of them, or get them to go along with your plan. Having a red team try to gain access to oDAO nodes is something that should also be strongly considered. Us humans are the weakest and easiest targets here.

langers · 9 June 2022 06:06

Hi @BLinc117.eth - thank you for the feedback.

The smart node software does work using multiple users. By default it will store some of its configuration in the home directory of the user that installed it - but you can set a different folder and it will work fine. Now that we are moving to a 2FA approach we will not need the secret manager for ssh related secrets.

Yes, generally each ODAO organisation operates 1 ODAO node. The core team is funded by our ODAO nodes and so we have more than 1. The ODAO is currently large enough that any operational issues or influence of our nodes will not affect the protocol but we agree we need to expand the ODAO to reduce the risk further. Enforcing an odd number makes sense, this may not always be possible but we will assess. As long as the number of members is high, the risk is extremely unlikely that half would be compromised.

Totally agree, I would like to implement a regular external security review which may include some red team action.

BLinc117.eth · 10 June 2022 13:56

We should have hard security requirements for oDAO nodes in my opinion. I find it troubling that we have oDAO nodes running with out 2FA, even our own documentation recommends NOs run 2FA, and many (most?) do.

I would propose that part of any new oDAO members vetting process be a validation that required security things are in place. We may be able to do this with a script or a proctored audit even.

langers · 15 June 2022 06:25

Agreed @BLinc117.eth

We will add this to the ODAO onboarding process to ensure they have key elements in place, particularly 2FA access.