The Nagle algorithm was created back in the day of multi-point networking. Multiple hosts were all tied to the same communications (Ethernet) channel, so they would use CSMA (https://en.wikipedia.org/wiki/Carrier-sense_multiple_access_...) to avoid collisions. CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel. (Each host can have any number of "users.") In fact, most modern (copper) (Gigabit+) Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES. A hybrid is used on the PHY at each end to subtract what is being transmitted from what is being received. Older (10/100 Base-T) can do the same thing because each end has dedicated TX/RX pairs. Fiber optic Ethernet can use either the same fiber with different wavelengths, or separate TX/RX fibers. I haven't seen a 10Base-2 Ethernet/DECnet interface for more than 25 years. If any are still operating somewhere, they are still using CSMA. CSMA is also still used for digital radio systems (WiFi and others). CSMA includes a "random exponential backoff timer" which does the (poor) job of managing congestion. (More modern congestion control methods exist today.) Back in the day, disabling the random backoff timer was somewhat equivalent to setting TCP_NODELAY.
Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.
False. It really was just intended to coalesce packets.
I’ll be nice and not attack the feature. But making that the default is one of the biggest mistakes in the history of networking (second only to TCP’s boneheaded congestion control that was designed imagining 56kbit links)
What did you still need to connect with 10mbit half duplex in 2014? I had gigabit to the desktop for a relatively small company in 2007, by 2014 10mb was pretty dead unless you had something Really Interesting connected....
If you worked in an industrial setting, legacy tech abounds due to the capital costs of replacing the equipment it supports (includes manufacturing, older hospitals, power plants, and etc). Many of these even still use token ring, coax, etc.
One co-op job at a manufacturing plant I worked at ~20 years ago involved replacing the backend core networking equipment with more modern ethernet kit, but we had to setup media converters (in that case token ring to ethernet) as close as possible to the manufacturing equipment (so that token ring only ran between the equipment and the media converter for a few meters at most).
They were "lucky" in that:
1) the networking protocol that was supported by the manufacturing equipment was IPX/SPX, so at least that worked cleanly on ethernet and newer upstream control software running on an OS (HP-UX at the time)
2) there were no lives at stake (eg nuclear safety/hospital), so they had minimal regulatory issues.
Technical debt goes hard, I had a discussion with a facilities guy why they never got around to ditch the last remnants of token ring in an office park. Fortunately in 2020 they had plenty of time to rip that stuff out without disturbing facility operation. Building automation, security and so on often lives way longer than you'd dare planning.
Everyone is forgetting the no delay is per application and not a system configuration. Yep, old things will still be old and that’s ok. That new fangled packet farter will need to set no delay which is a default in many scenarios. This article reminds us it is a thing and especially true for home grown applications.
There is always some legacy device which does weird/old connections. I distinctly remember the debit card terminals in the late '00 required a 10mbit capable ethernet connection which allowed x25 to be transmitted over the network. It is not a stretch to add 5 to 10 more years to those kind of devices.
There's plenty of use cases for small things which don't need any sorts of speeds, where you might as well have used a 115200 baud serial connection but ethernet is more useful. Designing electronics for 10Mbit/s is infinitely easier and cheaper than designing electronics for 100Mbit/s, so if you don't need 100Mbit/s, why would you spend the extra effort and expense?
There is also power consumption and reliability. I have part of my home network on 100Mbps. It eats about 60% less energy compared to Gb Ethernet. Less prone to interference from PoE.
Some old DEC devices used to connect console ports of servers. Didn't need it per say but also didn't need to spend $3k on multiple new console routers.
Was an old isp/mobile carrier so could find all kinds of old stuff. Even the first SMSC from the 80s (also DEC, 386 or similar cpu?) was still in it's racks because they didn't need the rack space as 2 modern racks used up all the power for that room, was also far down in a mountain so was annoying to remove equipment.
Thanks for the clarification. They're so close to being the same thing that I always call it CSMA/CD. Avoiding a collision is far more preferable than just detecting one.
Yeah, many enterprise switches don't even support 100Base-T or 10Base-T anymore. I've had to daisy chain an old switch that supports 100Base-T onto a modern one a few times myself. If you drop 10/100 support, you can also drop HD (simplex) support. In my junk drawer, I still have a few old 10/100 hubs (not switches), which are by definition always HD.
Is avoiding a collision always preferable? CSMA/CA has significant overhead (backoff period) for every single frame sent, on a less congested line CSMA/CD has less overhead.
Nagle is quite sensible when your application isn't taking any care to create sensibly-sized packets, and isn't so sensitive to latency. It avoids creating stupidly small packets unless your network is fast enough to handle them.
At this point, this is an application level problem and not something the kernel should be silently doing for you IMO. An option for legacy systems or known problematic hosts fine, but off by default and probably not a per SOCKOPT.
Every modern language has buffers in their stdlib. Anyone writing character at a time to the wire lazily or unintentionally should fix their application.
>> TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for
> Only because it's on by default for no real reason. I'm saying the default should be off.
This is wrong.
I'm assuming here that you mean that Nagle's algorithm is on by default, i.e TCP_NODELAY is off by default. It seems you think the only extra fingerprinting info TCP_NODELAY gives you is the single bit "TCP_NODELAY is on vs off". But it's more than that.
In a world where every application's traffic goes through Nagle's algorithm, lots of applications will just be seen to transmit a packet every 300ms or whatever as their transmissions are buffered up by the kernel to be sent in large packets. In a world where Nagle's algorithm is off by default, those applications could have very different packet sizes and timings.
With something like Telnet or SSH, you might even be able to detect who exactly is typing at the keyboard by analyzing their key press rhythm!
To be clear, this is not an argument in favor of Nagle's algorithm being on by default. I'm relatively neutral on that matter.
Nagles algorithm does really well when you're on shitty wifi.
Applications also don't know the MTU (the size of packets) on the interface they're using. Hell, they probably don't even know which interface they're using! This is all abstracted away. So, if you're on a network with a 14xx MTU (such as a VPN), assuming an MTU of 1500 means you'll send one full packet and then a tiny little packet after that. For every one packet you think you're sending!
Nagle's algorithm lets you just send data; no problem. Let the kernel batch up packets. If you control the protocol, just use a design that prevents Delayed ACK from causing the latency. IE, the "OK" from Redis.
If nobody is maintaining them, do we really need them? In which case, does it really matter?
If we need them, and they’re not being maintained, then maybe that’s the kind of “scream test” wake up we need for them to either be properly deprecated, or updated.
> If nobody is maintaining them, do we really need them?
Given how often issues can be traced back to open source projects barely scraping along? Yes and they are probably doing something important. Hell, if you create enough pointless busywork you can probably get a few more "helpfull" hackers into projects like xz.
So to be clear, you believe every program that outputs a bulk stream to stdout should be written to check if stdout is a socket and enable Nagle's algorithm if so? That's not just busywork - it's also an abstraction violation. By explicitly turning off Nagle's, you specify that you understand TCP performance and don't need the abstraction, and this is a reasonable way to do things. Imagine if the kernel pinned threads to cores by default and you had to ask to unpin them...
No, the program should take care to enable TCP_NODELAY when creating the socket. If the program gets passed a FD from outside it's on the outside program to ensure this. If somehow the program very often gets outside FDs from an oblivious source that could be a TCP socket, then it might indeed have to manually check if it really wants Nagle's algorithm.
If by "latency" you mean a hundred milliseconds or so, that's one thing, but I've seen Nagle delay packets by several seconds. Which is just goofy, and should never have been enabled by default, given the lack of an explicit flush function.
A smarter implementation would have been to call it TCP_MAX_DELAY_MS, and have it take an integer value with a well-documented (and reasonably low) default.
It delays one RTT, so if you have seen seconds of delays that means your TCP ACK packages were received seconds later for whatever reason (high load?). Decreasing latency in that situation would WORSEN the situation.
Reminds me of trying to do IoT stuff in hospitals before IoT was a thing.
Send exactly one 205 byte packet. How do you really know? I can see it go out on a scope. And the other end receives a packet with bytes 0-56. Then another packet with bytes 142-204. Finally a packet a 200ms later with bytes 57-141.
I think you are confusing network layers and their functionality.
"CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel."
Ethernet really isn't ptp. You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
"Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES."
That's full duplex as opposed to half duplex.
Nagle's algo has nothing to do with all that messy layer 1/2 stuff but is at the TCP layer and is an attempt to batch small packets into fewer larger ones for a small gain in efficiency. It is one of many optimisations at the TCP layer, such as Jumbo Frames and mini Jumbo Frames and much more.
> You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
CSMA/CD is specifically for a shared medium (shared collision domain in Ethernet terminology), putting a switch in it makes every port its own collision domain that are (in practice these days) always point-to-point. Especially for gigabit Ethernet, there was some info in the spec allowing for half-duplex operation with hubs but it was basically abandoned.
As others have said, different mechanisms are used to manage trying to send more data than a switch port can handle but not CSMA (because it's not doing any of it using Carrier Sense, and it's technically not Multiple Access on the individual segment, so CSMA isn't the mechanism being used).
> That's full duplex as opposed to half duplex.
No actually they're talking about something more complex, 100Mbps Ethernet had full duplex with separate transmit and receive pairs, but with 1000Base-T (and 10GBase-T etc.) the four pairs all simultaneously transmit and receive 250 Mbps (to add up to 1Gbps in each direction). Not that it's really relevant to the discussion but it is really cool and much more interesting than just being full duplex.
It's P2P as far as the physical layer (L1) is concerned.
Usually, full duplex requires two separate channels. The introduction of a hybrid on each end allows the use of the same channel at the same time.
Some progress has been made in doing the same thing with radio links, but it's harder.
Nagle's algorithm is somewhat intertwined with the backoff timer in the sense that it prevents transmitting a packet until some condition is met. IIRC, setting the TCP_NODELAY flag will also disable the backoff timer, at least this is true in the case of TCP/IP over AX25.
> It's P2P as far as the physical layer (L1) is concerned.
Only in the sense that the L1 "peer" is the switch. As soon as the switch goes to forward the packet, if ports 2 and 3 are both sending to port 1 at 1Gbps and port 1 is a 1Gbps port, 2Gbps won't fit and something's got to give.
Right but the switch has internal buffers and ability to queue those packets or apply backpressure. Resolving at that level is a very different matter from an electrical collision at L1.
Not as far as TCP is concerned it isn't. You sent the network a packet and it had to throw it away because something else sent packets at the same time. It doesn't care whether the reason was an electrical collision or not. A buffer is just a funny looking wire.
Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."
The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
No idea why you are mentioning radios. That's another medium.
My understanding is that no one used hubs anymore, so your collision domain goes from a number of machines on a hub to a dedicated channel between the switch and the machine. There obviously won’t be collisions if you’re the only one talking and you’re able to do full duplex communications without issue.
> Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."
Gigabit (and faster) is able to do full duplex without needing separate wires in each direction. That's the distinction they're making.
> The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
Not in a modern network, where there's no such thing as a wired collision.
> Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
Switches are not hubs. Switches have a separate receiver for each port, and each receiver is attached to one sender.
In modern ethernet, there is also flow-control via the PAUSE frame. This is not for collisions at the media level, but you might think of it as preventing collisions at the buffer level. It allows the receiver to inform the sender to slow down, rather than just dropping frames when its buffers are full.
At least in networks I've used, it's better for buffers to overflow than to use PAUSE.
Too many switches will get a PAUSE frame from port X and send it to all the ports that send packets destined for port X. Then those ports stop sending all traffic for a while.
About the only useful thing is if you can see PAUSE counters from your switch, you can tell a host is unhealthy from the switch whereas inbound packet overflows on the host might not be monitored... or whatever is making the host slow to handle packets might also delay monitoring.
I found this article while debugging some networking delays for a game that I'm working on.
It turns out that in my case it wasn't TCP_NODELAY - my backend is written in go, and go sets TCP_NODELAY by default!
But I still found the article - and in particular Nagle's acknowledgement of the issues! - to be interesting.
There's a discussion from two years ago here: https://news.ycombinator.com/item?id=40310896 - but I figured it'd been long enough that others might be interested in giving this a read too.
There is also a good write-up [0] by Julia Evans. We ran into this with DICOM storescp, which is a chatty protocol and TCP_NODELAY=1 makes the throughput significantly better. Since DICOM is often used in a LAN, that default just makes it unnecessarily worse.
Any details on the game you’ve been working on? I’ve been really enjoying Ebitengine and Golang for game dev so would love to read about what you’ve been up to!
I've been playing with multiplayer games that run over SSH; right now I'm trying to push the framerate on the games as high as I can, which is what got me thinking about my networking stack.
I mostly use go these days for the backend for my multiplayer games, and in this case there's also some good tooling for terminal rendering and SSH stuff in go, so it's a nice choice.
(my games are often pretty weird, I understand that "high framerate multiplayer game over SSH" is a not a uhhh good idea, that's the point!)
I swear, it seems like I’ve seen some variation of this 50 times on HN in the past 15 years.
The core issue with Nagle’s algorithm (TCP_NODELAY off) is its interaction with TCP Delayed ACK. Nagle prevents sending small packets if an ACK is outstanding, while the receiver delays that ACK to piggyback it on a response. When both are active, you get a 200ms "deadlock" where the sender waits for an ACK and the receiver waits for more data. This is catastrophic for latency-sensitive applications like gaming, SSH, or high-frequency RPCs.
In modern times, the bandwidth saved by Nagle is rarely worth the latency cost. You should almost always set TCP_NODELAY = 1 for any interactive or request-response protocol. The "problem" only shifts to the application layer: if you disable Nagle and then perform many small write() calls (like writing a single byte at a time), you will flood the network with tiny, inefficient packets.
Proper usage means disabling Nagle at the socket level but managing your own buffering in user-space. Use a buffered writer to assemble a logical message into a single memory buffer, then send it with one system call. This ensures your data is dispatched immediately without the overhead of thousands of tiny headers. Check the Linux tcp(7) man page for implementation details; it is the definitive reference for these behaviors.
Wildly, the Polish word "nagle" (pronounced differently) means "suddenly" or "all at once", which is just astonishingly apropos for what I'm almost certain is pure coincidence.
Strangely, the Polish word seems to encode a superposition of both settings: with NODELAY on, TCP sends messages suddenly, whereas with NODELAY off it sends tiny messages all at once, in one TCP packet.
I’m no expert by any means, but this makes sense to me. Plus, I can’t come up with many modern workloads where delayed ACK would result in significant improvement. That said, I feel the same about Nagle’s algorithm - if most packets are big, it seems to me that both features solve problems that hardly exist anymore.
Wouldn't the modern http-dominated best practice be to turn both off?
> Unfortunately, it’s not just delayed ACK2. Even without delayed ack and that stupid fixed timer, the behavior of Nagle’s algorithm probably isn’t what we want in distributed systems. A single in-datacenter RTT is typically around 500μs, then a couple of milliseconds between datacenters in the same region, and up to hundreds of milliseconds going around the globe. Given the vast amount of work a modern server can do in even a few hundred microseconds, delaying sending data for even one RTT isn’t clearly a win.
I'm surprised the article didn't also mention MSG_MORE. On Linux it hints to the kernel that "more is to follow" (when sending data on a socket) so it shouldn't send it just yet. Maybe you need to send a header followed by some data. You could copy them into one buffer and use a single sendmsg call, but it's easier to send the header with MSG_MORE and the data in separate calls.
(io_uring is another method that helps a lot here, and it can be combined with MSG_MORE or with preallocated buffers shared with the kernel.)
Indeed you can, but we've found it useful to use MSG_MORE when using state machines, where different states are responsible for different parts of the reply. (Plenty of examples in states*.c here: https://gitlab.com/nbdkit/libnbd/-/tree/master/generator?ref...)
Doing more system calls isn't really a good idea for performance.
Also if you're doing asynchronous writes you typically can only have one write in-flight at any time, you should aggregate all other buffers while that happens.
Though arguably asynchronous writes are often undesired due to the complexity of doing flow-control with them.
I've always found Nagle's algorithm being a kernel-level default quite silly. It should be up to the application to decide when to send and when to buffer and defer.
I've always thought a problem with Nagel's algorithm is, that the socket API does not (really) have a function to flush the buffers and send everything out instantly, so you can use that after messages that require a timely answer.
For stuff where no answer is required, Nagel's algorithm works very well for me, but many TCP channels are mixed use these days. They send messages that expect a fast answer and other that are more asynchronous (from a users point of view, not a programmers).
Wouldn't it be nice if all operating systems, (home-)routers, firewalls and programming languages would have high quality implementations of something like SCTP...
> the socket API does not (really) have a function to flush the buffers and send everything out instantly, so you can use that after messages that require a timely answer.
I never thought about that but I think you're absolutely right! In hindsight it's a glaring oversight to offer a stream API without the ability to flush the buffer.
Yeah, I’ve always felt that the stream API is a leaky abstraction for providing access to networking. I understand the attraction of making network I/O look like local file access given the philosophy of UNIX.
The API should have been message oriented from the start. This would avoid having the network stack try to compensate for the behavior of the application layer. Then Nagel’s or something like it would just be a library available for applications that might need it.
The stream API is as annoying on the receiving end especially when wrapping (like TLS) is involved. Basically you have to code your layers as if the underlying network is handing you a byte at a time - and the application has to try to figure out where the message boundaries are - adding a great deal of complexity.
the whole point of TCP is that it is a stream of bytes, not of messages.
The problem is that this is not in practice quite what most applications need, but the Internet evolved towards UDP and TCP only.
So you can have message-based if you want, but then you have to do sequencing, gap filling or flow control yourself, or you can have the overkill reliable byte stream with limited control or visibility at the application level.
For me, the “whole point” of TCP is to add various delivery guarantees on top of IP. It does not mandate or require a particular API. Of course, you can provide a stream API over TCP which suits many applications but it does not suit all and by forcing this abstraction over TCP you end up making message oriented applications (e.g request /response type protocols) more complex to implement than if you had simply exposed the message oriented reality of TCP via an API.
Well, it also has the advantage of providing pretty decent encryption for free through WSS.
But yeah, where that's unnecessary, it's probably just as easy to have a 4-byte length prefix, since TCP handles the checksum and retransmit and everything for you.
It's just a standard TLS layer, works with any TCP protocol, nothing WebSocket-specific in it.
You should ideally design your messages to fit within a single Ethernet packet, so 2 bytes is more than enough for the size. Though I have sadly seen an increasing amount of developers send arbitrarily large network messages and not care about proper design.
The socket API is all kinds of bad. The way streams should work is that, when sending data, you set a bit indicating whether it’s okay to buffer the data locally before sending. So a large send could be done as a series of okay-to-buffer writes and then a flush-immediately write.
TCP_CORK is a rather kludgey alternative.
The same issue exists with file IO. Writing via an in-process buffer (default behavior or stdio and quite a few programming languages) is not interchangeable with unbuffered writes — with a buffer, it’s okay to do many small writes, but you cannot assume that the data will ever actually be written until you flush.
I’m a bit disappointed that Zig’s fancy new IO system pretends that buffered and unbuffered IO are two implementations of the same thing.
Very on brand, oxide's core proposition is to actually invent a new (server) os+hardware, so they question/polish many of the traditional protocols and standards from the golden era.
I've always thought that Nagle's algorithm is putting policy in the kernel where it doesn't really belong.
If userspace applications want to make latency/throughput tradeoffs they can already do that with full awareness and control using their own buffers, which will also often mean fewer syscalls too.
The actual algorithm (which is pretty sensible in the absence of delayed ack) is fundamentally a feature of the TCP stack, which in most cases lives in the kernel. To implement the direct equivalent in userspace against the sockets API would require an API to find out about unacked data and would be clumsy at best.
With that said, I'm pretty sure it is a feature of the TCP stack only because the TCP stack is the layer they were trying to solve this problem at, and it isn't clear at all that "unacked data" is particularly better than a timer -- and of course if you actually do want to implement application layer Nagle directly, delayed acks mean that application level acking is a lot less likely to require an extra packet.
Technically yes, practically userspace apps are written by mostly people that either don't, or don't want to care about lower levels. There is plenty of badly written userspace code that will stay badly written.
And it would be right choice if it worked. Hell, simple 20ms flush timer would've made it work just fine.
It's kind of in User Space though - right? When an application opens a socket - it decides whether to open it with TCP_NODELAY or not. There isn't any kernel/os setting - it's done on a socket by socket basis, no?
The tradeoff on one program can influence the other program needing perhaps the opposite decision of such tradeoff. Thus we need the arbiter in the kernel to be able to control what is more important for the whole system. So my guess.
> The bigger problem is that TCP_QUICKACK doesn’t fix the fundamental problem of the kernel hanging on to data longer than my program wants it to.
Well, of course not; it tries to reduce the problem of your kernel hanging on to an ack (or genearting an ack) longer than you would like. That pertains to received data. If the remote end is sending you data, and is paused due to filling its buffers due to not getting an ack from you, it behooves you to send an ack ASAP.
The original Berkeley Unix implementation of TCP/IP, I seem to recall, had a single global 500 ms timer for sending out acks. So when your TCP connection received new data eligible for acking, it could be as long as 500 ms before the ack was sent. If we reframe that in modern realities, we can imagine every other delay is negligible, and data is coming at the line rate of a multi gigabit connection, 500 ms represents a lot of unacknowledged bits.
Delayed acks are similar to Nagle in spirit in that they promote coalescing at the possible cost of performance. Under the assumption that the TCP connection is bidirectional and "chatty" (so that even when the bulk of the data transfer is happening in one direction, there are application-level messages in the other direction) the delayed ack creates opportunities for the TCP ACK to be piggy backed on a data transfer. A TCP segment carrying no data, only an ACK, is prevented.
As far as portability of TCP_QUICKACK goes, in C code it is as simple as #ifdef TCP_QUICKACK. If the constant exists, use it. Otherwise out of luck. If you're in another language, you have to to through some hoops depending on whether the network-related run time exposes nonportable options in a way you can test, or whether you are on your own.
What if occasional latency is fine, and latency on terrible networks with high packet loss is fine, but you want the happy case to have little latency? Both many (non-competitive) games and SSH falls into this: reliability is more important than achieving the absolute lowest latency possible, but lower latency is still better than higher latency.
Ha ha, well that's a relief. I thought the article was going to say that enabling TCP_NODELAY is causing problems in distributed systems. I am one of those people who just turn on TCP_NODELAY and never look back because it solves problems instantly and the downsides seem minimal. Fortunately, the article is on my side. Just enable TCP_NODELAY if you think it's a good idea. It apparently doesn't break anything in general.
The problem is actually that nobody uses the generic solution to these classes of problems and then everybody complains that the special-case for one set of parameters works poorly for a different set of parameters.
Nagle’s algorithm is just a special case solution of the generic problem of choosing when and how long to batch. We want to batch because batching usually allows for more efficient batched algorithms, locality, less overhead etc. You do not want to batch because that increases latency, both when collecting enough data to batch and because you need to process the whole batch.
One class of solution is “Work or Time”. You batch up to a certain amount of work or up to a certain amount of time, whichever comes first. You choose your amount of time as your desired worst case latency. You choose your amount of work as your efficient batch size (it should be less than max throughput * latency, otherwise you will always hit your timer first).
Nagle’s algorithm is “Work” being one packet (~1.5 KB) with “Time” being the time until all data gets a ack (you might already see how this degree of dynamism in your timeout might pose a problem already) which results in the fallback timer of 500 ms when delayed ack is on. It should be obvious that is a terrible set of parameters for modern connections. The problem is that Nagle’s algorithm only deals with the “Work” component, but punts on the “Time” component allowing for nonsense like delayed ack helpfully “configuring” your effective “Time” component to a eternity resulting in “stuck” buffers which is what the timeout is supposed to avoid. I will decline to discuss the other aspect which is choosing when to buffer and how much of which Nagle’s algorithm is again a special case.
Delayed ack is, funnily enough, basically the exact same problem but done on the receive side. So both sides set timeouts based on the other side going first which is obviously a recipe for disaster. They both set fixed “Work”, but no fixed “Time” resulting in the situation where both drivers are too polite to go first.
What should be done is use the generic solutions that are parameterized by your system and channel properties which holistically solve these problems which would take too long to describe in depth here.
Then at a lower level and smaller latencies it's often interrupt moderation that must be disabled. Conceptually similar idea to the Nagle algo - coalesce overheads by waiting, but on the receiving end in hardware.
I first ran into this years ago after working on a database client library as an intern. Having not heard of this option beforehand, I didn't think to enable it in the connections the library opened, and in practice that often led to messages in the wire protocol being entirely ready for sending without actually getting sent immediately. I only found out about it later when someone using it investigated why the latency was much higher than they expected, and I guess either they had run into this before or were able to figure out that it might be the culprit, and it turned out that pretty much all of the existing clients in other languages set NODELAY unconditionally.
OK, I suppose I should say something. I've already written on this before, and that was linked above.
You never want TCP_NODELAY off at the sending end, and delayed ACKs on at the receiving end. But there's no way to set that from one end. Hence the problem.
Is TCP_NODELAY off still necessary? Try sending one-byte TCP sends in a tight loop and see what it does to other traffic on the same path, for, say, a cellular link. Today's links may be able to tolerate the 40x extra traffic. It was originally put in as a protection device against badly behaved senders.
A delayed ACK should be thought of as a bet on the behavior of the listening application. If the listening application usually responds fast, within the ACK delay interval, the delayed ACK is coalesced into the reply and you save a packet. If the listening application does not respond immediately, a delayed ACK has to actually be sent, and nothing was gained by delaying it. It would be useful for TCP implementations to tally, for each socket, the number of delayed ACKs actually sent vs. the number coalesced. If many delayed ACKs are being sent, ACK delay should be turned off, rather than repeating a losing bet.
This should have been fixed forty years ago. But I was out of networking by the time this conflict appeared. I worked for an aerospace company, and they wanted to move all networking work from Palo Alto to Colorado Springs, Colorado. Colorado Springs was building a router based on the Zilog Z8000, purely for military applications. That turned out to be a dead end. The other people in networking in Palo Alto went off to form a startup to make a "PC LAN" (a forgotten 1980s concept), and for about six months, they led that industry. I ended up leaving and doing things for Autodesk, which worked out well.
It's a bit tricky in that browsers may be using TCP_NODELAY anyway or use QUIC (UDP) and whatnots BUT, in doubt, I've got a wrapper script around my browsers launcher script that does LD_PRELOAD with TCP_NODELAY correctly configured.
Dunno if it helps but it helps me feel better.
What speeds up browsing the most though IMO is running your own DNS resolver, null routing a big part of the Internet, firewalling off entire countries (no really I don't need anything from North Korea, China or Russia for example), and then on top of that running dnsmasq locally.
I run the unbound DNS (on a little Pi so it's on 24/7) with gigantic killfiles, then I use 1.1.1.3 on top of that (CloudFlare's DNS that filters out known porn and known malware: yes, it's CloudFlare and, yes, I own shares of NET).
Some sites complain I use an "ad blocker" but it's really just null routing a big chunk of the interwebz.
That and LD_PRELOAD a lib with TCP_NODELAY: life is fast and good. Very low latency.
This is true for simple UDP, but reliable transports are often built over UDP.
As with anything in computing, there are trade-offs between the approaches. One example is QUIC now widespread in browsers.
MoldUDP64 is used by various exchanges (that's NASDAQ's name, others do something close). It's a simple UDP protocol with sequence numbers; works great on quality networks with well-tuned receivers (or FPGAs). This is an old-school blog article about the earlier MoldUDP:
Another is Aeron.io, which is a high-performance messaging system that includes a reliable unicast/multicast transport. There is so much cool stuff in this project and it is useful to study. I saw this deep-dive into the Aeron reliable multicast protocol live and it is quite good, albeit behind a sign-up.
Strictly speaking, you can put any protocol on top of UDP, including a copy of TCP...
But I took parent's question as "should I be using UDP sockets instead of TCP sockets". Once you invent your new protocol instead of UDP or on top of it, you can have any features you want.
I fondly remember a simple simulation project we had to do with a group of 5 students in a second year class which had a simulation and some kind of scheduler which communicated via TCP. I was appalled at the perfomance we were getting. Even on the same machine it was way too slow for what it was doing. After hours of debugging in turned out it was indeed Nagle's algorithm causing the slowness, which I never heard about at the time. Fixed instantly with TCP_NODELAY. It was one of the first times it was made abundantly clear to me the teachers at that institution didn't know what they were teaching. Apparently we were the only group that had noticed the slow performance, and the teachers had never even heard of TCP_NODELAY.
> , suggesting that the default behavior is wrong, and perhaps that the whole concept is outmoded
While outmoded might be the case, wrong is probably not the case.
There's some features of the network protocols that are designed to improve the network, not the individual connection. It's not novel that you can improve your connection. By disabling "good neighbour" features.
Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.
I’ll be nice and not attack the feature. But making that the default is one of the biggest mistakes in the history of networking (second only to TCP’s boneheaded congestion control that was designed imagining 56kbit links)
What would you change here?
Upgraded our DC switches to new ones around 2014 and needed to keep a few old ones because the new ones didn't support 10Mbit half duplex.
One co-op job at a manufacturing plant I worked at ~20 years ago involved replacing the backend core networking equipment with more modern ethernet kit, but we had to setup media converters (in that case token ring to ethernet) as close as possible to the manufacturing equipment (so that token ring only ran between the equipment and the media converter for a few meters at most).
They were "lucky" in that:
1) the networking protocol that was supported by the manufacturing equipment was IPX/SPX, so at least that worked cleanly on ethernet and newer upstream control software running on an OS (HP-UX at the time)
2) there were no lives at stake (eg nuclear safety/hospital), so they had minimal regulatory issues.
Was an old isp/mobile carrier so could find all kinds of old stuff. Even the first SMSC from the 80s (also DEC, 386 or similar cpu?) was still in it's racks because they didn't need the rack space as 2 modern racks used up all the power for that room, was also far down in a mountain so was annoying to remove equipment.
Yeah, many enterprise switches don't even support 100Base-T or 10Base-T anymore. I've had to daisy chain an old switch that supports 100Base-T onto a modern one a few times myself. If you drop 10/100 support, you can also drop HD (simplex) support. In my junk drawer, I still have a few old 10/100 hubs (not switches), which are by definition always HD.
Every modern language has buffers in their stdlib. Anyone writing character at a time to the wire lazily or unintentionally should fix their application.
TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for.
Yes, as I mentioned, it should be kept around for this but off by default. Make it a sysctl param, done.
> TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for
Only because it's on by default for no real reason. I'm saying the default should be off.
> Only because it's on by default for no real reason. I'm saying the default should be off.
This is wrong.
I'm assuming here that you mean that Nagle's algorithm is on by default, i.e TCP_NODELAY is off by default. It seems you think the only extra fingerprinting info TCP_NODELAY gives you is the single bit "TCP_NODELAY is on vs off". But it's more than that.
In a world where every application's traffic goes through Nagle's algorithm, lots of applications will just be seen to transmit a packet every 300ms or whatever as their transmissions are buffered up by the kernel to be sent in large packets. In a world where Nagle's algorithm is off by default, those applications could have very different packet sizes and timings.
With something like Telnet or SSH, you might even be able to detect who exactly is typing at the keyboard by analyzing their key press rhythm!
To be clear, this is not an argument in favor of Nagle's algorithm being on by default. I'm relatively neutral on that matter.
Applications also don't know the MTU (the size of packets) on the interface they're using. Hell, they probably don't even know which interface they're using! This is all abstracted away. So, if you're on a network with a 14xx MTU (such as a VPN), assuming an MTU of 1500 means you'll send one full packet and then a tiny little packet after that. For every one packet you think you're sending!
Nagle's algorithm lets you just send data; no problem. Let the kernel batch up packets. If you control the protocol, just use a design that prevents Delayed ACK from causing the latency. IE, the "OK" from Redis.
If we need them, and they’re not being maintained, then maybe that’s the kind of “scream test” wake up we need for them to either be properly deprecated, or updated.
Given how often issues can be traced back to open source projects barely scraping along? Yes and they are probably doing something important. Hell, if you create enough pointless busywork you can probably get a few more "helpfull" hackers into projects like xz.
A smarter implementation would have been to call it TCP_MAX_DELAY_MS, and have it take an integer value with a well-documented (and reasonably low) default.
Send exactly one 205 byte packet. How do you really know? I can see it go out on a scope. And the other end receives a packet with bytes 0-56. Then another packet with bytes 142-204. Finally a packet a 200ms later with bytes 57-141.
FfffFFFFffff You!
"CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel."
Ethernet really isn't ptp. You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
"Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES."
That's full duplex as opposed to half duplex.
Nagle's algo has nothing to do with all that messy layer 1/2 stuff but is at the TCP layer and is an attempt to batch small packets into fewer larger ones for a small gain in efficiency. It is one of many optimisations at the TCP layer, such as Jumbo Frames and mini Jumbo Frames and much more.
CSMA/CD is specifically for a shared medium (shared collision domain in Ethernet terminology), putting a switch in it makes every port its own collision domain that are (in practice these days) always point-to-point. Especially for gigabit Ethernet, there was some info in the spec allowing for half-duplex operation with hubs but it was basically abandoned.
As others have said, different mechanisms are used to manage trying to send more data than a switch port can handle but not CSMA (because it's not doing any of it using Carrier Sense, and it's technically not Multiple Access on the individual segment, so CSMA isn't the mechanism being used).
> That's full duplex as opposed to half duplex.
No actually they're talking about something more complex, 100Mbps Ethernet had full duplex with separate transmit and receive pairs, but with 1000Base-T (and 10GBase-T etc.) the four pairs all simultaneously transmit and receive 250 Mbps (to add up to 1Gbps in each direction). Not that it's really relevant to the discussion but it is really cool and much more interesting than just being full duplex.
Usually, full duplex requires two separate channels. The introduction of a hybrid on each end allows the use of the same channel at the same time.
Some progress has been made in doing the same thing with radio links, but it's harder.
Nagle's algorithm is somewhat intertwined with the backoff timer in the sense that it prevents transmitting a packet until some condition is met. IIRC, setting the TCP_NODELAY flag will also disable the backoff timer, at least this is true in the case of TCP/IP over AX25.
Only in the sense that the L1 "peer" is the switch. As soon as the switch goes to forward the packet, if ports 2 and 3 are both sending to port 1 at 1Gbps and port 1 is a 1Gbps port, 2Gbps won't fit and something's got to give.
Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."
The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
No idea why you are mentioning radios. That's another medium.
Gigabit (and faster) is able to do full duplex without needing separate wires in each direction. That's the distinction they're making.
> The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
Not in a modern network, where there's no such thing as a wired collision.
> Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
Switches are not hubs. Switches have a separate receiver for each port, and each receiver is attached to one sender.
Too many switches will get a PAUSE frame from port X and send it to all the ports that send packets destined for port X. Then those ports stop sending all traffic for a while.
About the only useful thing is if you can see PAUSE counters from your switch, you can tell a host is unhealthy from the switch whereas inbound packet overflows on the host might not be monitored... or whatever is making the host slow to handle packets might also delay monitoring.
It turns out that in my case it wasn't TCP_NODELAY - my backend is written in go, and go sets TCP_NODELAY by default!
But I still found the article - and in particular Nagle's acknowledgement of the issues! - to be interesting.
There's a discussion from two years ago here: https://news.ycombinator.com/item?id=40310896 - but I figured it'd been long enough that others might be interested in giving this a read too.
[0]: https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-...
[1]: https://news.ycombinator.com/item?id=10607422
I mostly use go these days for the backend for my multiplayer games, and in this case there's also some good tooling for terminal rendering and SSH stuff in go, so it's a nice choice.
(my games are often pretty weird, I understand that "high framerate multiplayer game over SSH" is a not a uhhh good idea, that's the point!)
The core issue with Nagle’s algorithm (TCP_NODELAY off) is its interaction with TCP Delayed ACK. Nagle prevents sending small packets if an ACK is outstanding, while the receiver delays that ACK to piggyback it on a response. When both are active, you get a 200ms "deadlock" where the sender waits for an ACK and the receiver waits for more data. This is catastrophic for latency-sensitive applications like gaming, SSH, or high-frequency RPCs.
In modern times, the bandwidth saved by Nagle is rarely worth the latency cost. You should almost always set TCP_NODELAY = 1 for any interactive or request-response protocol. The "problem" only shifts to the application layer: if you disable Nagle and then perform many small write() calls (like writing a single byte at a time), you will flood the network with tiny, inefficient packets.
Proper usage means disabling Nagle at the socket level but managing your own buffering in user-space. Use a buffered writer to assemble a logical message into a single memory buffer, then send it with one system call. This ensures your data is dispatched immediately without the overhead of thousands of tiny headers. Check the Linux tcp(7) man page for implementation details; it is the definitive reference for these behaviors.
https://en.wikipedia.org/wiki/Nominative_determinism
I’m no expert by any means, but this makes sense to me. Plus, I can’t come up with many modern workloads where delayed ACK would result in significant improvement. That said, I feel the same about Nagle’s algorithm - if most packets are big, it seems to me that both features solve problems that hardly exist anymore.
Wouldn't the modern http-dominated best practice be to turn both off?
> Unfortunately, it’s not just delayed ACK2. Even without delayed ack and that stupid fixed timer, the behavior of Nagle’s algorithm probably isn’t what we want in distributed systems. A single in-datacenter RTT is typically around 500μs, then a couple of milliseconds between datacenters in the same region, and up to hundreds of milliseconds going around the globe. Given the vast amount of work a modern server can do in even a few hundred microseconds, delaying sending data for even one RTT isn’t clearly a win.
Then we can make TCP become UDP.
And then we solved everything.
Both linux and Windows have this config but it's buggy so we're back to TCP and UDP.
(io_uring is another method that helps a lot here, and it can be combined with MSG_MORE or with preallocated buffers shared with the kernel.)
Also if you're doing asynchronous writes you typically can only have one write in-flight at any time, you should aggregate all other buffers while that happens.
Though arguably asynchronous writes are often undesired due to the complexity of doing flow-control with them.
For stuff where no answer is required, Nagel's algorithm works very well for me, but many TCP channels are mixed use these days. They send messages that expect a fast answer and other that are more asynchronous (from a users point of view, not a programmers).
Wouldn't it be nice if all operating systems, (home-)routers, firewalls and programming languages would have high quality implementations of something like SCTP...
I never thought about that but I think you're absolutely right! In hindsight it's a glaring oversight to offer a stream API without the ability to flush the buffer.
The API should have been message oriented from the start. This would avoid having the network stack try to compensate for the behavior of the application layer. Then Nagel’s or something like it would just be a library available for applications that might need it.
The stream API is as annoying on the receiving end especially when wrapping (like TLS) is involved. Basically you have to code your layers as if the underlying network is handing you a byte at a time - and the application has to try to figure out where the message boundaries are - adding a great deal of complexity.
The problem is that this is not in practice quite what most applications need, but the Internet evolved towards UDP and TCP only.
So you can have message-based if you want, but then you have to do sequencing, gap filling or flow control yourself, or you can have the overkill reliable byte stream with limited control or visibility at the application level.
Very well said. I think there is enormous complexity in many layers because we don't have that building block easily available.
But yeah, where that's unnecessary, it's probably just as easy to have a 4-byte length prefix, since TCP handles the checksum and retransmit and everything for you.
You should ideally design your messages to fit within a single Ethernet packet, so 2 bytes is more than enough for the size. Though I have sadly seen an increasing amount of developers send arbitrarily large network messages and not care about proper design.
TCP_CORK is a rather kludgey alternative.
The same issue exists with file IO. Writing via an in-process buffer (default behavior or stdio and quite a few programming languages) is not interchangeable with unbuffered writes — with a buffer, it’s okay to do many small writes, but you cannot assume that the data will ever actually be written until you flush.
I’m a bit disappointed that Zig’s fancy new IO system pretends that buffered and unbuffered IO are two implementations of the same thing.
Seems like there's been a disconnect between users and kernel developers here?
oxide and friends episode on it! It's quite good
If userspace applications want to make latency/throughput tradeoffs they can already do that with full awareness and control using their own buffers, which will also often mean fewer syscalls too.
With that said, I'm pretty sure it is a feature of the TCP stack only because the TCP stack is the layer they were trying to solve this problem at, and it isn't clear at all that "unacked data" is particularly better than a timer -- and of course if you actually do want to implement application layer Nagle directly, delayed acks mean that application level acking is a lot less likely to require an extra packet.
BTW, Hardware based TCP offloads engine exists... Don't think they are widely used nowadays though
Widely used in low latency fields like trading
And it would be right choice if it worked. Hell, simple 20ms flush timer would've made it work just fine.
Well, of course not; it tries to reduce the problem of your kernel hanging on to an ack (or genearting an ack) longer than you would like. That pertains to received data. If the remote end is sending you data, and is paused due to filling its buffers due to not getting an ack from you, it behooves you to send an ack ASAP.
The original Berkeley Unix implementation of TCP/IP, I seem to recall, had a single global 500 ms timer for sending out acks. So when your TCP connection received new data eligible for acking, it could be as long as 500 ms before the ack was sent. If we reframe that in modern realities, we can imagine every other delay is negligible, and data is coming at the line rate of a multi gigabit connection, 500 ms represents a lot of unacknowledged bits.
Delayed acks are similar to Nagle in spirit in that they promote coalescing at the possible cost of performance. Under the assumption that the TCP connection is bidirectional and "chatty" (so that even when the bulk of the data transfer is happening in one direction, there are application-level messages in the other direction) the delayed ack creates opportunities for the TCP ACK to be piggy backed on a data transfer. A TCP segment carrying no data, only an ACK, is prevented.
As far as portability of TCP_QUICKACK goes, in C code it is as simple as #ifdef TCP_QUICKACK. If the constant exists, use it. Otherwise out of luck. If you're in another language, you have to to through some hoops depending on whether the network-related run time exposes nonportable options in a way you can test, or whether you are on your own.
If you care about latency, you should consider something datagram oriented like UDP or SCTP.
Nagle’s algorithm is just a special case solution of the generic problem of choosing when and how long to batch. We want to batch because batching usually allows for more efficient batched algorithms, locality, less overhead etc. You do not want to batch because that increases latency, both when collecting enough data to batch and because you need to process the whole batch.
One class of solution is “Work or Time”. You batch up to a certain amount of work or up to a certain amount of time, whichever comes first. You choose your amount of time as your desired worst case latency. You choose your amount of work as your efficient batch size (it should be less than max throughput * latency, otherwise you will always hit your timer first).
Nagle’s algorithm is “Work” being one packet (~1.5 KB) with “Time” being the time until all data gets a ack (you might already see how this degree of dynamism in your timeout might pose a problem already) which results in the fallback timer of 500 ms when delayed ack is on. It should be obvious that is a terrible set of parameters for modern connections. The problem is that Nagle’s algorithm only deals with the “Work” component, but punts on the “Time” component allowing for nonsense like delayed ack helpfully “configuring” your effective “Time” component to a eternity resulting in “stuck” buffers which is what the timeout is supposed to avoid. I will decline to discuss the other aspect which is choosing when to buffer and how much of which Nagle’s algorithm is again a special case.
Delayed ack is, funnily enough, basically the exact same problem but done on the receive side. So both sides set timeouts based on the other side going first which is obviously a recipe for disaster. They both set fixed “Work”, but no fixed “Time” resulting in the situation where both drivers are too polite to go first.
What should be done is use the generic solutions that are parameterized by your system and channel properties which holistically solve these problems which would take too long to describe in depth here.
"Golang disables Nagle's Algorithm by default"
1. https://news.ycombinator.com/item?id=34179426
You never want TCP_NODELAY off at the sending end, and delayed ACKs on at the receiving end. But there's no way to set that from one end. Hence the problem.
Is TCP_NODELAY off still necessary? Try sending one-byte TCP sends in a tight loop and see what it does to other traffic on the same path, for, say, a cellular link. Today's links may be able to tolerate the 40x extra traffic. It was originally put in as a protection device against badly behaved senders.
A delayed ACK should be thought of as a bet on the behavior of the listening application. If the listening application usually responds fast, within the ACK delay interval, the delayed ACK is coalesced into the reply and you save a packet. If the listening application does not respond immediately, a delayed ACK has to actually be sent, and nothing was gained by delaying it. It would be useful for TCP implementations to tally, for each socket, the number of delayed ACKs actually sent vs. the number coalesced. If many delayed ACKs are being sent, ACK delay should be turned off, rather than repeating a losing bet.
This should have been fixed forty years ago. But I was out of networking by the time this conflict appeared. I worked for an aerospace company, and they wanted to move all networking work from Palo Alto to Colorado Springs, Colorado. Colorado Springs was building a router based on the Zilog Z8000, purely for military applications. That turned out to be a dead end. The other people in networking in Palo Alto went off to form a startup to make a "PC LAN" (a forgotten 1980s concept), and for about six months, they led that industry. I ended up leaving and doing things for Autodesk, which worked out well.
Disabling Nagle's algorithm should be done as a matter of principle, there's simply no modern network configuration where it's beneficial.
Dunno if it helps but it helps me feel better.
What speeds up browsing the most though IMO is running your own DNS resolver, null routing a big part of the Internet, firewalling off entire countries (no really I don't need anything from North Korea, China or Russia for example), and then on top of that running dnsmasq locally.
I run the unbound DNS (on a little Pi so it's on 24/7) with gigantic killfiles, then I use 1.1.1.3 on top of that (CloudFlare's DNS that filters out known porn and known malware: yes, it's CloudFlare and, yes, I own shares of NET).
Some sites complain I use an "ad blocker" but it's really just null routing a big chunk of the interwebz.
That and LD_PRELOAD a lib with TCP_NODELAY: life is fast and good. Very low latency.
As with anything in computing, there are trade-offs between the approaches. One example is QUIC now widespread in browsers.
MoldUDP64 is used by various exchanges (that's NASDAQ's name, others do something close). It's a simple UDP protocol with sequence numbers; works great on quality networks with well-tuned receivers (or FPGAs). This is an old-school blog article about the earlier MoldUDP:
https://www.fragmentationneeded.net/2012/01/dispatches-from-...
Another is Aeron.io, which is a high-performance messaging system that includes a reliable unicast/multicast transport. There is so much cool stuff in this project and it is useful to study. I saw this deep-dive into the Aeron reliable multicast protocol live and it is quite good, albeit behind a sign-up.
https://aeron.io/other/handling-data-loss-with-aeron/
https://enet.bespin.org
But I took parent's question as "should I be using UDP sockets instead of TCP sockets". Once you invent your new protocol instead of UDP or on top of it, you can have any features you want.
While outmoded might be the case, wrong is probably not the case.
There's some features of the network protocols that are designed to improve the network, not the individual connection. It's not novel that you can improve your connection. By disabling "good neighbour" features.