Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Jack Doan <jackdoan@rivian.com>
Co-authored-by: brad-defined <77982333+brad-defined@users.noreply.github.com>
Co-authored-by: Jack Doan <me@jackdoan.com>
experimental test to see if we can have a test mode that verifies
mutexes lock in the order we want, while having no hit on production
performance. Since this uses a build tag, it should all compile out
during the build process and be a no-op unless the tag is set.
* add calculated_remotes
This setting allows us to "guess" what the remote might be for a host
while we wait for the lighthouse response. For networks that hard
designed with in mind, it can help speed up handshake performance, as well as
improve resiliency in the case that all lighthouses are down.
Example:
lighthouse:
# ...
calculated_remotes:
# For any Nebula IPs in 10.0.10.0/24, this will apply the mask and add
# the calculated IP as an initial remote (while we wait for the response
# from the lighthouse). Both CIDRs must have the same mask size.
# For example, Nebula IP 10.0.10.123 will have a calculated remote of
# 192.168.1.123
10.0.10.0/24:
- mask: 192.168.1.0/24
port: 4242
* figure out what is up with this test
* add test
* better logic for sending handshakes
Keep track of the last light of hosts we sent handshakes to. Only log
handshake sent messages if the list has changed.
Remove the test Test_NewHandshakeManagerTrigger because it is faulty and
makes no sense. It relys on the fact that no handshake packets actually
get sent, but with these changes we would send packets now (which it
should!)
* use atomic.Pointer
* cleanup to make it clearer
* fix typo in example
These new helpers make the code a lot cleaner. I confirmed that the
simple helpers like `atomic.Int64` don't add any extra overhead as they
get inlined by the compiler. `atomic.Pointer` adds an extra method call
as it no longer gets inlined, but we aren't using these on the hot path
so it is probably okay.
* don't set ConnectionState to nil
We might have packets processing in another thread, so we can't safely
just set this to nil. Since we removed it from the hostmaps, the next
packets to process should start the handshake over again.
I believe this comment is outdated or incorrect, since the next
handshake will start over with a new HostInfo, I don't think there is
any way a counter reuse could happen:
> We must null the connectionstate or a counter reuse may happen
Here is a panic we saw that I think is related:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x93a037]
goroutine 59 [running, locked to thread]:
github.com/slackhq/nebula.(*Firewall).Drop(...)
github.com/slackhq/nebula/firewall.go:380
github.com/slackhq/nebula.(*Interface).consumeInsidePacket(...)
github.com/slackhq/nebula/inside.go:59
github.com/slackhq/nebula.(*Interface).listenIn(...)
github.com/slackhq/nebula/interface.go:233
created by github.com/slackhq/nebula.(*Interface).run
github.com/slackhq/nebula/interface.go:191
* use closeTunnel
We have a few small race conditions with creating the HostInfo.ConnectionState
since we add the host info to the pendingHostMap before we set this
field. We can make everything a lot easier if we just add an "init"
function so that we can set this field in the hostinfo before we add it
to the hostmap.
This allows you to configure remote allow lists specific to different
subnets of the inside CIDR. Example:
remote_allow_ranges:
10.42.42.0/24:
192.168.0.0/16: true
This would only allow hosts with a VPN IP in the 10.42.42.0/24 range to
have private IPs (and thus don't connect over public IPs).
The PR also refactors AllowList into RemoteAllowList and LocalAllowList to make it clearer which methods are allowed on which allow list.
If we receive a handshake packet for a tunnel that has already been
completed, check to see if the new remote is preferred. If so, update to
the preferred remote and send a test packet to influence the other side
to do the same.
* Add more metrics
This change adds the following counter metrics:
Metrics to track packets dropped at the firewall:
firewall.dropped.local_ip
firewall.dropped.remote_ip
firewall.dropped.no_rule
Metrics to track handshakes attempts that have been initiated and ones
that have timed out (ones that have completed are tracked by the
existing "handshakes" histogram).
handshake_manager.initiated
handshake_manager.timed_out
Metrics to track when cached_packets are dropped because we run out of
buffer space, and how many are sent once the handshake completes.
hostinfo.cached_packets.dropped
hostinfo.cached_packets.sent
This change also notes how many cached packets we have when we log the
final "Handshake received" message for either stage1 for stage2.
* separate incoming/outgoing metrics
* remove "allowed" firewall metrics
We don't need this on the hotpath, they aren't worh it.
* don't need pointers here
This check was accidentally typo'd in #396 from `%` to `&`. Restore the
correct functionality here (we want to do the check every "PromoteEvery"
count packets).
There are some subtle race conditions with the previous handshake_ix implementation, mostly around collisions with localIndexId. This change refactors it so that we have a "commit" phase during the handshake where we grab the lock for the hostmap and ensure that we have a unique local index before storing it. We also now avoid using the pending hostmap at all for receiving stage1 packets, since we have everything we need to just store the completed handshake.
Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Ryan Huber <rhuber@gmail.com>
Co-authored-by: forfuncsake <drussell@slack-corp.com>
This change fixes all of the known data races that `make smoke-docker-race` finds, except for one.
Most of these races are around the handshake phase for a hostinfo, so we add a RWLock to the hostinfo and Lock during each of the handshake stages.
Some of the other races are around consistently using `atomic` around the `messageCounter` field. To make this harder to mess up, I have renamed the field to `atomicMessageCounter` (I also removed the unnecessary extra pointer deference as we can just point directly to the struct field).
The last remaining data race is around reading `ConnectionInfo.ready`, which is a boolean that is only written to once when the handshake has finished. Due to it being in the hot path for packets and the rare case that this could actually be an issue, holding off on fixing that one for now.
here is the results of `make smoke-docker-race`:
before:
lighthouse1: Found 2 data race(s)
host2: Found 36 data race(s)
host3: Found 17 data race(s)
host4: Found 31 data race(s)
after:
host2: Found 1 data race(s)
host4: Found 1 data race(s)
Fixes: #147Fixes: #226Fixes: #283Fixes: #316