Commit Graph

55 Commits

Author SHA1 Message Date
Jay Wren
0827a6f1c5 claude suggests this bound checking optimization
it is about 4.4% better on the bench-cpu-long cpu usage test
2025-10-23 13:27:37 -04:00
Jack Doan
4bea299265 don't send recv errors for packets outside the connection window anymore (#1463)
* don't send recv errors for packets outside the connection window anymore

* Pull in fix from #1459, add my opinion on maxRecvError

* remove recv_error counter entirely
2025-09-03 11:52:52 -05:00
Nate Brown
52623820c2 Drop inactive tunnels (#1427) 2025-07-03 09:58:37 -05:00
brad-defined
442a52879b Fix off by one error in IPv6 packet parser (#1419)
Some checks failed
gofmt / Run gofmt (push) Successful in 39s
smoke-extra / Run extra smoke tests (push) Failing after 21s
smoke / Run multi node smoke test (push) Failing after 1m21s
Build and test / Build all and test on ubuntu-linux (push) Failing after 18m35s
Build and test / Build and test on linux with boringcrypto (push) Failing after 2m36s
Build and test / Build and test on linux with pkcs11 (push) Failing after 2m28s
Build and test / Build and test on macos-latest (push) Has been cancelled
Build and test / Build and test on windows-latest (push) Has been cancelled
2025-06-11 15:15:15 -04:00
Wade Simmons
b8ea55eb90 optimize usage of bart (#1395)
Some checks failed
gofmt / Run gofmt (push) Successful in 9s
smoke-extra / Run extra smoke tests (push) Failing after 19s
smoke / Run multi node smoke test (push) Failing after 1m19s
Build and test / Build all and test on ubuntu-linux (push) Failing after 18m41s
Build and test / Build and test on linux with boringcrypto (push) Failing after 2m47s
Build and test / Build and test on linux with pkcs11 (push) Failing after 2m47s
Build and test / Build and test on macos-latest (push) Has been cancelled
Build and test / Build and test on windows-latest (push) Has been cancelled
Use `bart.Lite` and `.Contains` as suggested by the bart maintainer:

- 9455952eed (commitcomment-155362580)
2025-04-18 12:37:20 -04:00
Nate Brown
d97ed57a19 V2 certificate format (#1216)
Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Jack Doan <jackdoan@rivian.com>
Co-authored-by: brad-defined <77982333+brad-defined@users.noreply.github.com>
Co-authored-by: Jack Doan <me@jackdoan.com>
2025-03-06 11:28:26 -06:00
Nate Brown
08ac65362e Cert interface (#1212) 2024-10-10 18:00:22 -05:00
Nate Brown
e264a0ff88 Switch most everything to netip in prep for ipv6 in the overlay (#1173) 2024-07-31 10:18:56 -05:00
Nate Brown
c1711bc9c5 Remove tcp rtt tracking from the firewall (#1114) 2024-04-11 21:44:22 -05:00
Wade Simmons
fe16ea566d firewall reject packets: cleanup error cases (#957) 2023-11-13 12:43:51 -06:00
Nate Brown
a44e1b8b05 Clean up a hostinfo to reduce memory usage (#955) 2023-11-02 16:53:59 -05:00
Nate Brown
5a131b2975 Combine ca, cert, and key handling (#952) 2023-08-14 21:32:40 -05:00
Nate Brown
a10baeee92 Pull hostmap and pending hostmap apart, remove unused functions (#843) 2023-07-24 12:37:52 -05:00
Nate Brown
03e4a7f988 Rehandshaking (#838)
Co-authored-by: Brad Higgins <brad@defined.net>
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2023-05-04 15:16:37 -05:00
brad-defined
9b03053191 update EncReader and EncWriter interface function args to have concrete types (#844)
* Update LightHouseHandlerFunc to remove EncWriter param.
* Move EncWriter to interface
* EncReader, too
2023-04-07 14:28:37 -04:00
Nate Brown
ee8e1348e9 Use connection manager to drive NAT maintenance (#835)
Co-authored-by: brad-defined <77982333+brad-defined@users.noreply.github.com>
2023-03-31 15:45:05 -05:00
Nate Brown
1a6c657451 Normalize logs (#837) 2023-03-30 15:07:31 -05:00
brad-defined
2801fb2286 Fix relay (#827)
Co-authored-by: Nate Brown <nbrown.us@gmail.com>
2023-03-30 11:09:20 -05:00
Wade Simmons
6e0ae4f9a3 firewall: add option to send REJECT replies (#738)
* firewall: add option to send REJECT replies

This change allows you to configure the firewall to send REJECT packets
when a packet is denied.

    firewall:
      # Action to take when a packet is not allowed by the firewall rules.
      # Can be one of:
      #   `drop` (default): silently drop the packet.
      #   `reject`: send a reject reply.
      #     - For TCP, this will be a RST "Connection Reset" packet.
      #     - For other protocols, this will be an ICMP port unreachable packet.
      outbound_action: drop
      inbound_action: drop

These packets are only sent to established tunnels, and only on the
overlay network (currently IPv4 only).

    $ ping -c1 192.168.100.3
    PING 192.168.100.3 (192.168.100.3) 56(84) bytes of data.
    From 192.168.100.3 icmp_seq=2 Destination Port Unreachable

    --- 192.168.100.3 ping statistics ---
    2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 31ms

    $ nc -nzv 192.168.100.3 22
    (UNKNOWN) [192.168.100.3] 22 (?) : Connection refused

This change also modifies the smoke test to capture tcpdump pcaps from
both the inside and outside to inspect what is going on over the wire.
It also now does TCP and UDP packet tests using the Nmap version of
ncat.

* calculate seq and ack the same was as the kernel

The logic a bit confusing, so we copy it straight from how the kernel
does iptables `--reject-with tcp-reset`:

- https://github.com/torvalds/linux/blob/v5.19/net/ipv4/netfilter/nf_reject_ipv4.c#L193-L221

* cleanup
2023-03-13 15:08:40 -04:00
Nate Brown
92cc32f844 Remove handshake race avoidance (#820)
Co-authored-by: Wade Simmons <wadey@slack-corp.com>
2023-03-13 12:35:14 -05:00
Nate Brown
a06977bbd5 Track connections by local index id instead of vpn ip (#807) 2023-02-13 14:41:05 -06:00
Caleb Jasik
12dbbd3dd3 Fix typos found by https://github.com/crate-ci/typos (#735) 2022-12-19 11:28:27 -06:00
Nate Brown
4c0ae3df5e Refuse to process double encrypted packets (#741) 2022-09-19 12:47:48 -05:00
Wade Simmons
7b9287709c add listen.send_recv_error config option (#670)
By default, Nebula replies to packets it has no tunnel for with a `recv_error` packet. This packet helps speed up re-connection
in the case that Nebula on either side did not shut down cleanly. This response can be abused as a way to discover if Nebula is running
on a host though. This option lets you configure if you want to send `recv_error` packets always, never, or only to private network remotes.
valid values: always, never, private

This setting is reloadable with SIGHUP.
2022-06-27 12:37:54 -04:00
brad-defined
1a7c575011 Relay (#678)
Co-authored-by: Wade Simmons <wsimmons@slack-corp.com>
2022-06-21 13:35:23 -05:00
Wade Simmons
45d1d2b6c6 Update dependencies - 2022-04 (#664)
Updated  github.com/kardianos/service         https://github.com/kardianos/service/compare/v1.2.0...v1.2.1
    Updated  github.com/miekg/dns                 https://github.com/miekg/dns/compare/v1.1.43...v1.1.48
    Updated  github.com/prometheus/client_golang  https://github.com/prometheus/client_golang/compare/v1.11.0...v1.12.1
    Updated  github.com/prometheus/common         https://github.com/prometheus/common/compare/v0.32.1...v0.33.0
    Updated  github.com/stretchr/testify          https://github.com/stretchr/testify/compare/v1.7.0...v1.7.1
    Updated  golang.org/x/crypto                  5770296d90...ae2d96664a
    Updated  golang.org/x/net                     69e39bad7d...749bd193bc
    Updated  golang.org/x/sys                     7861aae155...289d7a0edf
    Updated  golang.zx2c4.com/wireguard/windows   v0.5.1...v0.5.3
    Updated  google.golang.org/protobuf           v1.27.1...v1.28.0
2022-04-18 12:12:25 -04:00
Nate Brown
312a01dc09 Lighthouse reload support (#649)
Co-authored-by: John Maguire <contact@johnmaguire.me>
2022-03-14 12:35:13 -05:00
Wade Simmons
949ec78653 don't set ConnectionState to nil (#590)
* don't set ConnectionState to nil

We might have packets processing in another thread, so we can't safely
just set this to nil. Since we removed it from the hostmaps, the next
packets to process should start the handshake over again.

I believe this comment is outdated or incorrect, since the next
handshake will start over with a new HostInfo, I don't think there is
any way a counter reuse could happen:

> We must null the connectionstate or a counter reuse may happen

Here is a panic we saw that I think is related:

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x93a037]
    goroutine 59 [running, locked to thread]:
    github.com/slackhq/nebula.(*Firewall).Drop(...)
            github.com/slackhq/nebula/firewall.go:380
    github.com/slackhq/nebula.(*Interface).consumeInsidePacket(...)
            github.com/slackhq/nebula/inside.go:59
    github.com/slackhq/nebula.(*Interface).listenIn(...)
            github.com/slackhq/nebula/interface.go:233
    created by github.com/slackhq/nebula.(*Interface).run
            github.com/slackhq/nebula/interface.go:191

* use closeTunnel
2021-12-06 14:09:05 -05:00
Nate Brown
bcabcfdaca Rework some things into packages (#489) 2021-11-03 20:54:04 -05:00
Wade Simmons
ea2c186a77 remote_allow_ranges: allow inside CIDR specific remote_allow_lists (#540)
This allows you to configure remote allow lists specific to different
subnets of the inside CIDR. Example:

    remote_allow_ranges:
      10.42.42.0/24:
        192.168.0.0/16: true

This would only allow hosts with a VPN IP in the 10.42.42.0/24 range to
have private IPs (and thus don't connect over public IPs).

The PR also refactors AllowList into RemoteAllowList and LocalAllowList to make it clearer which methods are allowed on which allow list.
2021-10-19 10:54:30 -04:00
Andrii Chubatiuk
d13f4b5948 fixed recv_errors spoofing condition (#482)
Hi @nbrownus
Fixed a small bug that was introduced in
df7c7ee#diff-5d05d02296a1953fd5fbcb3f4ab486bc5f7c34b14c3bdedb068008ec8ff5beb4
having problems due to it
2021-06-03 13:04:04 -04:00
Nathan Brown
df7c7eec4a Get out faster on nil udpAddr (#449) 2021-04-26 20:21:47 -05:00
Nathan Brown
6f37280e8e Fully close tunnels when CloseAllTunnels is called (#448) 2021-04-26 10:42:24 -05:00
Nathan Brown
710df6a876 Refactor remotes and handshaking to give every address a fair shot (#437) 2021-04-14 13:50:09 -05:00
Nathan Brown
75f7bda0a4 Lighthouse performance pass (#418) 2021-03-31 17:32:02 -05:00
Nathan Brown
883e09a392 Don't use a global ca pool (#426) 2021-03-29 12:10:19 -05:00
Nathan Brown
3ea7e1b75f Don't use a global logger (#423) 2021-03-26 09:46:30 -05:00
Nathan Brown
7073d204a8 IPv6 support for outside (udp) (#369) 2021-03-18 20:37:24 -05:00
Wade Simmons
6c55d67f18 Refactor handshake_ix (#401)
There are some subtle race conditions with the previous handshake_ix implementation, mostly around collisions with localIndexId. This change refactors it so that we have a "commit" phase during the handshake where we grab the lock for the hostmap and ensure that we have a unique local index before storing it. We also now avoid using the pending hostmap at all for receiving stage1 packets, since we have everything we need to just store the completed handshake.

Co-authored-by: Nate Brown <nbrown.us@gmail.com>
Co-authored-by: Ryan Huber <rhuber@gmail.com>
Co-authored-by: forfuncsake <drussell@slack-corp.com>
2021-03-12 14:16:25 -05:00
Wade Simmons
64d8035d09 fix race in getOrHandshake (#400)
We missed this race with #396 (and I think this is also the crash in
issue #226). We need to lock a little higher in the getOrHandshake
method, before we reset hostinfo.ConnectionInfo. Previously, two
routines could enter this section and confuse the handshake process.

This could result in the other side sending a recv_error that also has
a race with setting hostinfo.ConnectionInfo back to nil. So we make sure
to grab the lock in handleRecvError as well.

Neither of these code paths are in the hot path (handling packets
between two hosts over an active tunnel) so there should be no
performance concerns.
2021-03-09 09:27:02 -05:00
Wade Simmons
2a4beb41b9 Routine-local conntrack cache (#391)
Previously, every packet we see gets a lock on the conntrack table and updates it. When running with multiple routines, this can cause heavy lock contention and limit our ability for the threads to run independently. This change caches reads from the conntrack table for a very short period of time to reduce this lock contention. This cache will currently default to disabled unless you are running with multiple routines, in which case the default cache delay will be 1 second. This means that entries in the conntrack table may be up to 1 second out of date and remain in a routine local cache for up to 1 second longer than the global table.

Instead of calling time.Now() for every packet, this cache system relies on a tick thread that updates the current cache "version" each tick. Every packet we check if the cache version is out of date, and reset the cache if so.
2021-03-01 19:52:17 -05:00
Wade Simmons
1bae5b2550 more validation in pending hostmap deletes (#344)
We are currently seeing some cases where we are not deleting entries
correctly from the pending hostmap. I believe this is a case of
an inbound timer tick firing and deleting the Hosts map entry for
a newer handshake attempt than intended, thus leaving the old Indexes
entry orphaned. This change adds some extra checking when deleteing from
the Indexes and Hosts maps to ensure we clean everything up correctly.
2021-03-01 12:40:46 -05:00
Tim Rots
e7e6a23cde fix a few typos (#302) 2021-03-01 11:14:34 -05:00
Wade Simmons
27d9a67dda Proper multiqueue support for tun devices (#382)
This change is for Linux only.

Previously, when running with multiple tun.routines, we would only have one file descriptor. This change instead sets IFF_MULTI_QUEUE and opens a file descriptor for each routine. This allows us to process with multiple threads while preventing out of order packet reception issues.

To attempt to distribute the flows across the queues, we try to write to the tun/UDP queue that corresponds with the one we read from. So if we read a packet from tun queue "2", we will write the outgoing encrypted packet to UDP queue "2". Because of the nature of how multi queue works with flows, a given host tunnel will be sticky to a given routine (so if you try to performance benchmark by only using one tunnel between two hosts, you are only going to be using a max of one thread for each direction).

Because this system works much better when we can correlate flows between the tun and udp routines, we are deprecating the undocumented "tun.routines" and "listen.routines" parameters and introducing a new "routines" parameter that sets the value for both. If you use the old undocumented parameters, the max of the values will be used and a warning logged.

Co-authored-by: Nate Brown <nbrown.us@gmail.com>
2021-02-25 15:01:14 -05:00
Wade Simmons
ee7c27093c add HostMap.RemoteIndexes (#329)
This change adds an index based on HostInfo.remoteIndexId. This allows
us to use HostMap.QueryReverseIndex without having to loop over all
entries in the map (this can be a bottleneck under high traffic
lighthouses).

Without this patch, a high traffic lighthouse server receiving recv_error
packets and lots of handshakes, cpu pprof trace can look like this:

      flat  flat%   sum%        cum   cum%
    2000ms 32.26% 32.26%     3040ms 49.03%  github.com/slackhq/nebula.(*HostMap).QueryReverseIndex
     870ms 14.03% 46.29%     1060ms 17.10%  runtime.mapiternext

Which shows 50% of total cpu time is being spent in QueryReverseIndex.
2020-11-23 14:51:16 -05:00
Wade Simmons
2e7ca027a4 Lighthouse handler optimizations (#320)
We noticed that the number of memory allocations LightHouse.HandleRequest creates for each call can seriously impact performance for high traffic lighthouses. This PR introduces a benchmark in the first commit and then optimizes memory usage by creating a LightHouseHandler struct. This struct allows us to re-use memory between each lighthouse request (one instance per UDP listener go-routine).
2020-11-23 14:50:01 -05:00
Wade Simmons
aba42f9fa6 enforce the use of goimports (#248)
* enforce the use of goimports

Instead of enforcing `gofmt`, enforce `goimports`, which also asserts
a separate section for non-builtin packages.

* run `goimports` everywhere

* exclude generated .pb.go files
2020-06-30 18:53:30 -04:00
Wade Simmons
b37a91cfbc add meta packet statistics (#230)
This change add more metrics around "meta" (non "message" type packets).
For lighthouse packets, we also record statistics around the specific
lighthouse meta type.

We don't keep statistics for the "message" type so that we don't slow
down the fast path (and you can just look at metrics on the tun
interface to find that information).
2020-06-26 13:45:48 -04:00
Patrick Bogen
ecf0e5a9f6 drop packets even if we aren't going to emit Debug logs about it (#239)
* drop packets even if we aren't going to emit Debug logs about it

* smallify change
2020-06-10 16:55:49 -05:00
Patrick Bogen
363c836422 log the reason for fw drops (#220)
* log the reason for fw drops

* only prepare log if we will end up sending it
2020-04-10 10:57:21 -07:00