Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Jack Doan <jackdoan@rivian.com>
Co-authored-by: brad-defined <77982333+brad-defined@users.noreply.github.com>
Co-authored-by: Jack Doan <me@jackdoan.com>
A local index collision happens when two tunnels attempt to use the same
random int32 index ID. This is a rare chance, and we have code to deal
with it, but we have a panic because we return the wrong thing in this
case. This change should fix the panic.
We had a rare deadlock in GetOrHandshake because we kept the hostmap
lock when we do the call to StartHandshake. StartHandshake can block
while sending to the lighthouse query worker channel, and that worker
needs to be able to grab the hostmap lock to do its work. Other calls
for StartHandshake don't hold the hostmap lock so we should be able to
drop it here.
This lock was originally added with: https://github.com/slackhq/nebula/pull/954
* avoid deadlock in lighthouse queryWorker
If the lighthouse queryWorker tries to grab to call StartHandshake on
a lighthouse vpnIp, we can deadlock on the handshake_manager lock. This
change drops the handshake_manager lock before we send on the lighthouse
queryChan (which could block), and also avoids sending to the channel if
this is a lighthouse IP itself.
* need to hold lock during cacheCb
If the lighthouse queryWorker tries to grab to call StartHandshake on
a lighthouse vpnIp, we can deadlock on the handshake_manager lock. This
change drops the handshake_manager lock before we send on the lighthouse
queryChan (which could block), and also avoids sending to the channel if
this is a lighthouse IP itself.
* add calculated_remotes
This setting allows us to "guess" what the remote might be for a host
while we wait for the lighthouse response. For networks that hard
designed with in mind, it can help speed up handshake performance, as well as
improve resiliency in the case that all lighthouses are down.
Example:
lighthouse:
# ...
calculated_remotes:
# For any Nebula IPs in 10.0.10.0/24, this will apply the mask and add
# the calculated IP as an initial remote (while we wait for the response
# from the lighthouse). Both CIDRs must have the same mask size.
# For example, Nebula IP 10.0.10.123 will have a calculated remote of
# 192.168.1.123
10.0.10.0/24:
- mask: 192.168.1.0/24
port: 4242
* figure out what is up with this test
* add test
* better logic for sending handshakes
Keep track of the last light of hosts we sent handshakes to. Only log
handshake sent messages if the list has changed.
Remove the test Test_NewHandshakeManagerTrigger because it is faulty and
makes no sense. It relys on the fact that no handshake packets actually
get sent, but with these changes we would send packets now (which it
should!)
* use atomic.Pointer
* cleanup to make it clearer
* fix typo in example
We have a few small race conditions with creating the HostInfo.ConnectionState
since we add the host info to the pendingHostMap before we set this
field. We can make everything a lot easier if we just add an "init"
function so that we can set this field in the hostinfo before we add it
to the hostmap.
* Add more metrics
This change adds the following counter metrics:
Metrics to track packets dropped at the firewall:
firewall.dropped.local_ip
firewall.dropped.remote_ip
firewall.dropped.no_rule
Metrics to track handshakes attempts that have been initiated and ones
that have timed out (ones that have completed are tracked by the
existing "handshakes" histogram).
handshake_manager.initiated
handshake_manager.timed_out
Metrics to track when cached_packets are dropped because we run out of
buffer space, and how many are sent once the handshake completes.
hostinfo.cached_packets.dropped
hostinfo.cached_packets.sent
This change also notes how many cached packets we have when we log the
final "Handshake received" message for either stage1 for stage2.
* separate incoming/outgoing metrics
* remove "allowed" firewall metrics
We don't need this on the hotpath, they aren't worh it.
* don't need pointers here
There are some subtle race conditions with the previous handshake_ix implementation, mostly around collisions with localIndexId. This change refactors it so that we have a "commit" phase during the handshake where we grab the lock for the hostmap and ensure that we have a unique local index before storing it. We also now avoid using the pending hostmap at all for receiving stage1 packets, since we have everything we need to just store the completed handshake.
Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Ryan Huber <rhuber@gmail.com>
Co-authored-by: forfuncsake <drussell@slack-corp.com>