DAO curated node registry - 'Verified Operators'

Overview

‘Verified Operators’ are a DAO curated list of operators which are generally perceived as safer to stake with. Verified operators are in most cases reputable companies with proven experience in running POS staking and blockchain infrastructure services. The proposal is aimed to define the vetting process and criteria upon which verified operators will be voted in by the DAO.

The DAO is assuming the responsibility of screening and nominating new verified operators. The criteria and standards expected from VOs should be accessible and transparent to both the community and Operators applying to be included in the verified group.

Motivation

ssv.netwrok is an open protocol in which stakers(validators) are able to freely choose between the network’s nodes(operators). As the network grows, so does the difficulty of choosing between the different service providers in the network.

In addition,Operating a validator requires constant monitoring, maintenance and uptime in order to avoid penalties and slashing. As such, validators require stability and consistency from the operators managing their distributed validator.

To address the above, the DAO will maintain a curated list of ‘Verified Operators’. From a staker perspective, the verification adds another dimension to the operator selection process. It allows the staker to limit their choice to a narrowed list of verified operators instead of choosing between all of the operators in the network.

From the operator’s perspective, being verified can amount to creating network effect (more stakers choosing your node). In addition, the operator’s commitment to uphold the standards framed by the DAO should ensure their commitment to reliable performance and service quality.

Mechanics

  1. Operators should submit a proposal in the DAO’s forum (see a suggested format below)
  2. The community should have at least 7 days for comments and modifications
  3. The proposal should then be submitted to a 7 day vote period on Snapshot
  4. Once verified, an operator will be marked as such in the network’s explorer (example). And, the community will be officially notified about the newly added operator.

Criteria

  • Experience running Blockchain node infrastructure
  • Self hosted execution (eth1) and consensus (eth2 ) nodes
  • At least 1 month running an ssv node on testnet
  • Operator score at time of proposal is above x
  • Running at least 10 validator in testnet
  • Consistent operator score above X(TBD) for at least 1 week
  • Engage and support your validators (and community) at ssv.tesnet Discord

Operator Commitments

  • Aspire to maximal uptime and optimized performance
  • Timely update of the latest ssv node version
  • Alert and notify when issues occur
  • Issue a post mortem in case of prolonged downtime (>5 hours)
  • Promote client diversity, try to support minority clients
  • Transparency, maintain and update your Explorer Operator Page. Notify of any status changes which might affect stakers (i.e client changes, cloud instance changes etc.)

Loss of status

  1. DAO vote to remove your status, and,or;
  2. Operator score below 90% for 2 consecutive weeks (automatic loss of status)

Appendix - Verified Operator Request Template

Checklist:

  • Set the Title to Verified Operator Request [Company/Requestor Name]
    e.g., Verified Operator Request [BloxStaking]
  • Set ‘Verified Operators’ as the category
  1. Company/requestor name
  2. eth1 client - (geth, Parity, Besu etc… )
  3. eth 2 client - (Lighthouse, Teku, Prysm etc.)
  4. Data center information (self hosted/ cloud, location+server name, etc)
  5. Number of mainnet validators at time of proposal submission (add explorer link)
  6. Number of testnet validators at time of proposal submission (add explorer link)
  7. Background - introduction to your service, experience, achievements. Addressing your experience running SSV nodes will be advantageous; how long have you been participating in the network, how many validators are you operating, performance track record etc.
  8. We commit to timely updates of my node
  9. We commit to provide basic support to network validators in ssv.network Discord (or any other channel of your choosing)
  10. We commit to issue a postmortem in case of node downtime of over 12 hours
  11. We commit to notify validators and network participants about a planned downtime at least 48 hours in advance
  12. We commit to maintain and update our operator page to make sure stakers are able to make informed decisions about my service.
  13. Provide an ‘emergency contact’ in your operator page so that the stakers are able to reach out in case of issues or downtime. Preferred platforms are Discord and/or Telegram. The respective channels should be owned and monitored by the Operator.
4 Likes

I like it.
Here’s my feedback:

The community should have at least 3 days for comments and modifications

3 days is way too short. We know from experience, that we need at least 7 days to discuss it. I don’t see a problem with that, since one of the requirements is to run an operator for at least 1 month on the testnet. Why the rush?

Operator score at time of proposal is above x
Consistent operator score above X(TBD) for at least 1 week

Aren’t those terms redundant in a way? I’d recommend keeping the second term and increasing the time. We should optimize for high quality, instead of high speed. Overall, the proposal is good, but it feels like “we’re in a rush”.

Running at least 10 validators in testnet

Seems a very low number. One could easily set up 10 validators themselves. How do we prevent this?

Promote client diversity, try to support minority clients

Shouldn’t we “enforce” this in a way? What if we add this metric to the operator score? This way we’d incentivize using a minority client, and we give smaller operators a chance to stand out. Do we need a proposal on how the operator score is calculated, other than just the attestation rate?

3 Likes

Agree

Yea we can do just consistent score above X

The idea was some basic number to show performance, more validators will make the barrier higher. What did you have in mind?

For testnet, not sure… for mainnet probably.

1 Like

I don’t know anything about operator node performance. Still, maybe we pick a number that (out of experience) requires a) a relatively expensive/powerful node or b) a multinode/load-balanced setup to maintain a high operator score. I don’t know how many validators a single operator node could potentially handle before performance decreases.
This would also prove that the operator can serve a larger number of validators (which will come in after being ‘verified’).

However, I admit that this seems more complex for now. I just wanted to raise this “concern.”

You’d probably need a few hundreds of validator to start stressing anything on a basic setup.
I think the bigger (initial) challenge is to setup the node and make sure it performs well even on a basic number of validators. Anything after that is scale.

2 Likes

Understood! Makes sense. Thx for the explanation.

I agree with this proposal. Giving validators some certainty that at least a couple of their operators are verified is a great help!

I think including client choice in the operator score is a great idea as well.

Another way of doing it could be to directly incentivize validators to choose operators with a diverse set of clients through the operator SSV fee.
The base fee could be what the operator wants for their services and then we could add a fee on top as a “health” fee depending on which operators you’ve already chosen.

First prysm operator is cheap, second prysm is more expensive etc.

This would in turn incentivize operators into spinning up machines with clients that are “in demand” as they are cheaper to the validator.

operator diversity is another thing i think could be interesting to discuss but probably for a another time.

1 Like

Ultimately I think it’s the validator’s choice, keep in mind SSV will probably be used by devs more so than individual stakers.

Another point, it’s not easy to validate what operators actually running, if we create incentives for specific clients it will be much easier for them to simply “write” they use other clients then actually running them.

1 Like

Makes sense :+1: Maybe in the future it will be easier to validate what client an operator is running with the help of some type of blockprint-like signature or something otherwise yeah operators will surely just spoof what they are running…

Thinking about security and operations, additional operator commitments may make sense.

  • Commitment to following security best practices. Good question whether this should be spelled out here or be a separate document, maintained by?? Things such as:

    • Security updates for all components of the infrastructure (OS, node, execution, consensus) installed within 24 hours (this likely means unattended-upgrades for the OS)
    • SSH auth restricted to key or 2FA, no plain user/password
    • REST/WS/RPC APIs restricted to access from NOs own infrastructure, not “public”
    • If consensus/execution are accessed via Internet by SSV node, TLS encryption for that traffic
    • At-rest encryption of the SSV node’s DB storage, if it is with a cloud provider
  • Commitment to deploy hard fork updates to consensus / execution at least three (3) days before hard fork, if sufficient notice was given by EF / client teams

  • Commitment to deploy maintenance releases of all clients (node/consensus/execution) in a timely manner - 1 week?

  • Commitment to have only one storage provider for the SSV DB and slashing protection DB, and run only one ssv node at a time. If failover is desired, it must be handled by container orchestration and shared stateful storage. If storage is replicated (examples EFS, OnDat, ceph), it must prioritize data integrity over data availability - that is, in a split brain scenario, the storage that “split off” goes offline. The intent is to put guard rails around the risk of running the SSV node twice - so things like “custom replication scripts” should (must?) be avoided, ditto having failover modes that don’t use an orchestration framework, such as k8s or docker swarm mode.

Good question how prescriptive the DAO wants to get. Some level of “you need to be this tall” seems prudent.

1 Like

I like the proposal and framework, lots of good comments around here too

I’ve added some comments and ideas below. Overall I think we can keep it simple and clear for now and improve later on.

Mechanics
2. The community should have at least 3 days for comments and modifications → a bit short in my mind too. I think 1 week to 10 days is more appropriate

Criteria
→ Should we add that the Verified node operator should have at least a website with basic information and contact details?

Appendix - Verified Operator status request template
→ Join the Discord and share your organization Discord IDs? Can be helpful for contact later on
→ Add a security/emergency contact email?

2 Likes

Speaking as a DAppNode operator, I want to push the thinking back in the direction of simplicity.

If we require all the verified operators to run complex databases with replicated storage etc. it’s going to rule out all the smaller distributed operators.

Part of the beauty of the SSV model is that it requires at least 3 of the 4 operators to come to consensus before making an attestation, which means that it should be exceptionally rare for a validator to get slashed if just one of the four operators has a bad database problem.

If the SSV consensus layer works the way I think it should work, it should alleviate a lot of the concerns around a bad operator causing slashing.

But it does suggest to me that we should encourage users to select their operators from 4 totally separate pools. That way they don’t get slashed if one big operator has a database problem that impacts all the SSV operators in their farm.

2 Likes

I tend to agree, but you also want to be able to empower users to make optimal choices when it comes to their operators. Its not only about not getting slashed, also about optimizing performance and uptime and making sure your chosen operator is professional and responsive

Changed to 7 days
Website - not sure, there are a lot of great operators from Dappnodes for example which offer a hardware option (pretty unique) but are not a full blown company
Discord - I think that some sort of means of communication is in order. However, I dont want to limit operators to use the ssv.network’s Discord. Ill add the emergency contact part

1 Like

Lose of status

  1. DAO vote to remove your status
  2. Operator score below X(TBD) over a period exceeding 2 weeks

Typo: Needs to be “Loss of status”
How about 90% for the operator score?

Maybe 2. should be “2. Operator score below X% *(TBD) over a period exceeding 2 weeks will trigger a DAO vote to remove status, see 1.”

Should operators on cloud services divulge which service they are on so that you don’t get an AWS outage knocking out 3 of 4 nodes?

Definitely! I think part of being a verified operator is exactly that.

I thought about it a bit. I think there is a difference between testnet and mainnet in that regard.
One of the things I’ve put in my “path to mainnet” blog post was to create a verification framework which should be transparent and comprehensive. In that I think there is room to add a parameter that limits verified operator assigned shares as a way to quickly verify operators but limit the number of shares they can get at first, slowly raising it up the more that operator proves itself. Of course very well known operators can start with a higher limit.
This can give way for smaller operators to be verified and gradually make their setup more sophisticated the more shares they have.

For testnet I’d maybe go with a simpler approach.

I think that the removal should be automatic, if the DAO will need to vote on every downgrade it might take time and more users might get hurt by continuing to choose a faulty operator.

I added 90%, lets use that benchmark and see if its not contested

Typo fixed

2 Likes

Also to point out that AWS is not a monolithic service. For example, recent us-east-1 and us-west-1 outages didn’t touch my infrastructure in us-east-2.

Divulging service and service location is a good idea, I think.