Commit Graph

62 Commits

Author SHA1 Message Date
Jack Doan 2e9117da5b fix tunnels that could permanently escape connection-manager monitoring (#1752)
smoke-extra / freebsd-amd64 (push) Failing after 16s
smoke-extra / linux-amd64-ipv6disable (push) Failing after 15s
smoke-extra / netbsd-amd64 (push) Failing after 14s
smoke-extra / openbsd-amd64 (push) Failing after 16s
smoke-extra / linux-386 (push) Failing after 17s
smoke / Run multi node smoke test (push) Failing after 1m25s
Build and test / Static checks (push) Successful in 1m42s
Build and test / Test linux (push) Failing after 2m17s
Build and test / Test linux-boringcrypto (push) Failing after 3m9s
Build and test / Test linux-pkcs11 (push) Failing after 2m54s
Build and test / Cross-build linux-arm (push) Successful in 3m3s
Build and test / Cross-build linux-mips (push) Successful in 3m44s
Build and test / Cross-build linux-other (push) Successful in 3m7s
Build and test / Cross-build windows (push) Successful in 59s
Build and test / Cross-build freebsd (push) Successful in 1m33s
Build and test / Cross-build netbsd (push) Successful in 1m34s
Build and test / Cross-build openbsd (push) Successful in 1m33s
Build and test / Cross-build mobile (push) Successful in 3m15s
smoke-extra / Run windows smoke test (push) Has been cancelled
Build and test / Test macos (push) Has been cancelled
Build and test / Test windows (push) Has been cancelled
Build and test / CI status (push) Has been cancelled
2026-06-10 11:03:23 -05:00
Jack Doan 3db406b8ac fix a race in RelayState.CopyRelayIps (#1753) 2026-06-10 09:27:15 -05:00
Nate Brown d0f02ba873 Switch to slog, remove logrus (#1672) 2026-04-27 09:41:47 -05:00
Nate Brown 2f4532f102 No more dns globals, proper cleanup on shutdown (#1667) 2026-04-21 12:41:10 -05:00
Nate Brown 56067afca2 Stab at better logging when a relay is being used (#1533)
gofmt / Run gofmt (push) Failing after 5s
smoke-extra / Run extra smoke tests (push) Failing after 2s
smoke / Run multi node smoke test (push) Failing after 3s
Build and test / Build all and test on ubuntu-linux (push) Failing after 2s
Build and test / Build and test on linux with boringcrypto (push) Failing after 3s
Build and test / Build and test on linux with pkcs11 (push) Failing after 2s
Build and test / Build and test on macos-latest (push) Has been cancelled
Build and test / Build and test on windows-latest (push) Has been cancelled
2025-12-03 17:48:29 -06:00
Jack Doan a89f95182c Firewall types and cross-stack subnet stuff (#1509)
* firewall can distinguish if the host connecting has an overlapping network, is a VPN peer without an overlapping network, or is a unsafe network

* Cross stack subnet stuff (#1512)

* experiment with not filtering out non-common addresses in hostinfo.networks

* allow handshakes without overlaps

* unsafe network test

* change HostInfo.buildNetworks argument to reference the cert
2025-11-12 13:40:20 -06:00
Gary Guo 634181ba66 Fix incorrect CIDR construction in hostmap (#1493)
* Fix incorrect CIDR construction in hostmap

* Introduce a regression test for incorrect hostmap CIDR
2025-10-08 11:02:36 -05:00
Jack Doan 4bea299265 don't send recv errors for packets outside the connection window anymore (#1463)
* don't send recv errors for packets outside the connection window anymore

* Pull in fix from #1459, add my opinion on maxRecvError

* remove recv_error counter entirely
2025-09-03 11:52:52 -05:00
Nate Brown 52623820c2 Drop inactive tunnels (#1427) 2025-07-03 09:58:37 -05:00
brad-defined b158eb0c4c Use a list for relay IPs instead of a map (#1423)
* Use a list for relay IPs instead of a map

* linter
2025-07-02 08:47:05 -04:00
Wade Simmons b8ea55eb90 optimize usage of bart (#1395)
gofmt / Run gofmt (push) Successful in 9s
smoke-extra / Run extra smoke tests (push) Failing after 19s
smoke / Run multi node smoke test (push) Failing after 1m19s
Build and test / Build all and test on ubuntu-linux (push) Failing after 18m41s
Build and test / Build and test on linux with boringcrypto (push) Failing after 2m47s
Build and test / Build and test on linux with pkcs11 (push) Failing after 2m47s
Build and test / Build and test on macos-latest (push) Has been cancelled
Build and test / Build and test on windows-latest (push) Has been cancelled
Use `bart.Lite` and `.Contains` as suggested by the bart maintainer:

- https://github.com/gaissmai/bart/commit/9455952eedcf59a6e755fc28ed16e906fa4f3066#commitcomment-155362580
2025-04-18 12:37:20 -04:00
Nate Brown d97ed57a19 V2 certificate format (#1216)
Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Jack Doan <jackdoan@rivian.com>
Co-authored-by: brad-defined <77982333+brad-defined@users.noreply.github.com>
Co-authored-by: Jack Doan <me@jackdoan.com>
2025-03-06 11:28:26 -06:00
Nate Brown 08ac65362e Cert interface (#1212) 2024-10-10 18:00:22 -05:00
Nate Brown e264a0ff88 Switch most everything to netip in prep for ipv6 in the overlay (#1173) 2024-07-31 10:18:56 -05:00
Nate Brown a390125935 Support reloading preferred_ranges (#1043) 2024-04-03 22:14:51 -05:00
Nate Brown 072edd56b3 Fix re-entrant GetOrHandshake issues (#1044) 2023-12-19 11:58:31 -06:00
Nate Brown 5181cb0474 Use generics for CIDRTrees to avoid casting issues (#1004) 2023-11-02 17:05:08 -05:00
Nate Brown a44e1b8b05 Clean up a hostinfo to reduce memory usage (#955) 2023-11-02 16:53:59 -05:00
Nate Brown 076ebc6c6e Simplify getting a hostinfo or starting a handshake with one (#954) 2023-08-21 18:51:45 -05:00
Nate Brown 223cc6e660 Limit how often a busy tunnel can requery the lighthouse (#940)
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2023-08-08 13:26:41 -05:00
Nate Brown a10baeee92 Pull hostmap and pending hostmap apart, remove unused functions (#843) 2023-07-24 12:37:52 -05:00
Nate Brown 03e4a7f988 Rehandshaking (#838)
Co-authored-by: Brad Higgins <brad@defined.net>
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2023-05-04 15:16:37 -05:00
Nate Brown ee8e1348e9 Use connection manager to drive NAT maintenance (#835)
Co-authored-by: brad-defined <77982333+brad-defined@users.noreply.github.com>
2023-03-31 15:45:05 -05:00
brad-defined 2801fb2286 Fix relay (#827)
Co-authored-by: Nate Brown <nbrown.us@gmail.com>
2023-03-30 11:09:20 -05:00
Nate Brown f0ef80500d Remove dead code and re-order transit from pending to main hostmap on stage 2 (#828) 2023-03-17 15:36:24 -05:00
Wade Simmons e1af37e46d add calculated_remotes (#759)
* add calculated_remotes

This setting allows us to "guess" what the remote might be for a host
while we wait for the lighthouse response. For networks that hard
designed with in mind, it can help speed up handshake performance, as well as
improve resiliency in the case that all lighthouses are down.

Example:

    lighthouse:
      # ...

      calculated_remotes:
        # For any Nebula IPs in 10.0.10.0/24, this will apply the mask and add
        # the calculated IP as an initial remote (while we wait for the response
        # from the lighthouse). Both CIDRs must have the same mask size.
        # For example, Nebula IP 10.0.10.123 will have a calculated remote of
        # 192.168.1.123

        10.0.10.0/24:
          - mask: 192.168.1.0/24
            port: 4242

* figure out what is up with this test

* add test

* better logic for sending handshakes

Keep track of the last light of hosts we sent handshakes to. Only log
handshake sent messages if the list has changed.

Remove the test Test_NewHandshakeManagerTrigger because it is faulty and
makes no sense. It relys on the fact that no handshake packets actually
get sent, but with these changes we would send packets now (which it
should!)

* use atomic.Pointer

* cleanup to make it clearer

* fix typo in example
2023-03-13 15:09:08 -04:00
Nate Brown 92cc32f844 Remove handshake race avoidance (#820)
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2023-03-13 12:35:14 -05:00
Nate Brown a06977bbd5 Track connections by local index id instead of vpn ip (#807) 2023-02-13 14:41:05 -06:00
Wade Simmons 9af242dc47 switch to new sync/atomic helpers in go1.19 (#728)
These new helpers make the code a lot cleaner. I confirmed that the
simple helpers like `atomic.Int64` don't add any extra overhead as they
get inlined by the compiler. `atomic.Pointer` adds an extra method call
as it no longer gets inlined, but we aren't using these on the hot path
so it is probably okay.
2022-10-31 13:37:41 -04:00
brad-defined 1a7c575011 Relay (#678)
Co-authored-by: Wade Simmons <wsimmons@slack-corp.com>
2022-06-21 13:35:23 -05:00
Wade Simmons 949ec78653 don't set ConnectionState to nil (#590)
* don't set ConnectionState to nil

We might have packets processing in another thread, so we can't safely
just set this to nil. Since we removed it from the hostmaps, the next
packets to process should start the handshake over again.

I believe this comment is outdated or incorrect, since the next
handshake will start over with a new HostInfo, I don't think there is
any way a counter reuse could happen:

> We must null the connectionstate or a counter reuse may happen

Here is a panic we saw that I think is related:

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x93a037]
    goroutine 59 [running, locked to thread]:
    github.com/slackhq/nebula.(*Firewall).Drop(...)
            github.com/slackhq/nebula/firewall.go:380
    github.com/slackhq/nebula.(*Interface).consumeInsidePacket(...)
            github.com/slackhq/nebula/inside.go:59
    github.com/slackhq/nebula.(*Interface).listenIn(...)
            github.com/slackhq/nebula/interface.go:233
    created by github.com/slackhq/nebula.(*Interface).run
            github.com/slackhq/nebula/interface.go:191

* use closeTunnel
2021-12-06 14:09:05 -05:00
Nate Brown 467e605d5e Push route handling into overlay, a few more nits fixed (#581) 2021-11-12 11:19:28 -06:00
Nate Brown e07524a654 Move all of tun into overlay (#577) 2021-11-11 16:37:29 -06:00
Wade Simmons 304b12f63f create ConnectionState before adding to HostMap (#535)
We have a few small race conditions with creating the HostInfo.ConnectionState
since we add the host info to the pendingHostMap before we set this
field. We can make everything a lot easier if we just add an "init"
function so that we can set this field in the hostinfo before we add it
to the hostmap.
2021-11-08 14:46:22 -05:00
Nate Brown bcabcfdaca Rework some things into packages (#489) 2021-11-03 20:54:04 -05:00
brad-defined 6ae8ba26f7 Add a context object in nebula.Main to clean up on error (#550) 2021-11-02 13:14:26 -05:00
Wade Simmons ea2c186a77 remote_allow_ranges: allow inside CIDR specific remote_allow_lists (#540)
This allows you to configure remote allow lists specific to different
subnets of the inside CIDR. Example:

    remote_allow_ranges:
      10.42.42.0/24:
        192.168.0.0/16: true

This would only allow hosts with a VPN IP in the 10.42.42.0/24 range to
have private IPs (and thus don't connect over public IPs).

The PR also refactors AllowList into RemoteAllowList and LocalAllowList to make it clearer which methods are allowed on which allow list.
2021-10-19 10:54:30 -04:00
Wade Simmons ae5505bc74 handshake: update to preferred remote (#532)
If we receive a handshake packet for a tunnel that has already been
completed, check to see if the new remote is preferred. If so, update to
the preferred remote and send a test packet to influence the other side
to do the same.
2021-10-19 10:53:55 -04:00
Nate Brown d004fae4f9 Unlock the hostmap quickly, lock hostinfo instead (#459) 2021-05-05 13:10:55 -05:00
Wade Simmons 44cb697552 Add more metrics (#450)
* Add more metrics

This change adds the following counter metrics:

Metrics to track packets dropped at the firewall:

    firewall.dropped.local_ip
    firewall.dropped.remote_ip
    firewall.dropped.no_rule

Metrics to track handshakes attempts that have been initiated and ones
that have timed out (ones that have completed are tracked by the
existing "handshakes" histogram).

    handshake_manager.initiated
    handshake_manager.timed_out

Metrics to track when cached_packets are dropped because we run out of
buffer space, and how many are sent once the handshake completes.

    hostinfo.cached_packets.dropped
    hostinfo.cached_packets.sent

This change also notes how many cached packets we have when we log the
final "Handshake received" message for either stage1 for stage2.

* separate incoming/outgoing metrics

* remove "allowed" firewall metrics

We don't need this on the hotpath, they aren't worh it.

* don't need pointers here
2021-04-27 22:23:18 -04:00
Nathan Brown db23fdf9bc Dont apply race avoidance to existing handshakes, use the handshake time to determine who wins (#451)
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2021-04-27 21:15:34 -05:00
Nathan Brown 710df6a876 Refactor remotes and handshaking to give every address a fair shot (#437) 2021-04-14 13:50:09 -05:00
Nathan Brown 480036fbc8 Remove unused structs in hostmap.go (#430) 2021-04-01 22:07:11 -05:00
Nathan Brown 0c2e5973e1 Simple lie test (#427) 2021-03-31 10:26:35 -05:00
Wade Simmons 4603b5b2dd fix PromoteEvery check (#424)
This check was accidentally typo'd in #396 from `%` to `&`. Restore the
correct functionality here (we want to do the check every "PromoteEvery"
count packets).
2021-03-26 15:01:05 -04:00
Nathan Brown 3ea7e1b75f Don't use a global logger (#423) 2021-03-26 09:46:30 -05:00
Nathan Brown 7a9f9dbded Don't craft buffers if we don't need them (#416) 2021-03-22 18:25:06 -05:00
Nathan Brown 7073d204a8 IPv6 support for outside (udp) (#369) 2021-03-18 20:37:24 -05:00
Wade Simmons 6c55d67f18 Refactor handshake_ix (#401)
There are some subtle race conditions with the previous handshake_ix implementation, mostly around collisions with localIndexId. This change refactors it so that we have a "commit" phase during the handshake where we grab the lock for the hostmap and ensure that we have a unique local index before storing it. We also now avoid using the pending hostmap at all for receiving stage1 packets, since we have everything we need to just store the completed handshake.

Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Ryan Huber <rhuber@gmail.com>
Co-authored-by: forfuncsake <drussell@slack-corp.com>
2021-03-12 14:16:25 -05:00
Wade Simmons d604270966 Fix most known data races (#396)
This change fixes all of the known data races that `make smoke-docker-race` finds, except for one.

Most of these races are around the handshake phase for a hostinfo, so we add a RWLock to the hostinfo and Lock during each of the handshake stages.

Some of the other races are around consistently using `atomic` around the `messageCounter` field. To make this harder to mess up, I have renamed the field to `atomicMessageCounter` (I also removed the unnecessary extra pointer deference as we can just point directly to the struct field).

The last remaining data race is around reading `ConnectionInfo.ready`, which is a boolean that is only written to once when the handshake has finished. Due to it being in the hot path for packets and the rare case that this could actually be an issue, holding off on fixing that one for now.

here is the results of `make smoke-docker-race`:

before:

    lighthouse1: Found 2 data race(s)
    host2:       Found 36 data race(s)
    host3:       Found 17 data race(s)
    host4:       Found 31 data race(s)

after:

    host2: Found 1 data race(s)
    host4: Found 1 data race(s)

Fixes: #147
Fixes: #226
Fixes: #283
Fixes: #316
2021-03-05 21:18:33 -05:00