Building web apps from scratch - TCP - Part 3

In this post, we'll talk about the all mighty TCP! The protocol which powers the web and more. This article ended up quite beefy, but I assure you, knowing this stuff really is worth it! It will greatly enhance your usage of the socket interface. Knowing how TCP works really is a superpower :D

Layer 4: The Transport Layer

As mentioned earlier, L4 protocols work to add varying levels of reliability to the network and to address processes within a host. The most widespread transport protocol is the Transmission Control Protocol (TCP), which is used by the web. TCP solves the problem of packet loss and out-of-order reception, creating the abstraction of a reliable network from the point of view of applications. The way it works is by basically retrying until the message is received. In TCP, there is no concept of a message. The data being transferred is treated as an ordered sequence of bytes. Nonetheless, these bytes are sent over the networks in batches, which we call segments. The other L4 protocol is UDP, which is very minimal in the guarantees it offers. You can think of it as a L4 version of the IP protocol with the addition of ports and a checksum over the content (the IP checksum only covers its header). The UDP is very useful for applications that already manage communication unreliability internally, or just don't care: sometimes it's better to have fewer packets reaching the destination on time, than having all of them reach it a bit late. The web uses mostly TCP, so we will put UDP on the side and focus on TCP.

TCP Segment Layout

To learn about TCP, a good starting point is its segment layout. First, I want to give you an overview of all the fields, and then dive into the details of how they work with the following paragraphs.

0           1           2           3           4
+-----+-----+-----+-----+-----+-----+-----+-----+
|      source port      |    destination port   |
+-----+-----+-----+-----+-----+-----+-----+-----+
|                sequence number                |
+-----+-----+-----+-----+-----+-----+-----+-----+
|             acknowledgement number            |
+-----+-----+-----+-----+-----+-----+-----+-----+
|   offset  |   flags   |      window size      |
+-----+-----+-----+-----+-----+-----+-----+-----+
|        checksum       |     urgent pointer    |
+-----+-----+-----+-----+-----+-----+-----+-----+
|                                               |
|                ... options ...                |
|                                               |
+-----+-----+-----+-----+-----+-----+-----+-----+
|                                               |
|                ... payload ...                |
|                                               |
|                                               |
|                                               |
+-----+-----+-----+-----+-----+-----+-----+-----+

NOTE: each row is 4 bytes.

The port fields allow the addressing of specific processes within a host. When a TCP connection is initiated by a process, both input and output ports must be specified.

The sequence number identifies which bytes are contained by the segment. If a segment of size 100 has sequence number of 333, this means it's holding bytes with sequence numbers from 333 to 100+333-1=432. Sequence numbers are essential to ensure that data is delivered in order. I remember thinking of sequence number as something like a "segment ID", but that's not quite correct. If for some reason the receiver only acknowledges part of the segment (more on ACKs later), the sender will have to send a new segment which overlaps with the previous one. When a connection is established, each host chooses its own starting sequence number. From that point on, sequence numbers are increased by the number of bytes sent. Note that if the first sequence number were zero, the sequence number of a byte would correspond to its position in the stream relative to the start. Unfortunately, for security reasons hosts choose a random-ish sequence number.

The acknowledgement number is only used when the ACK flag is set and contains the next sequence number expected by the sender of the segment. With this, hosts know how many bytes the peer received and, by comparing it with the last sequence number, can infer how many are in transit or were lost in the network.

The offset contains the number of 32 bit words that make up the header. This is necessary since segment headers have variable length due to options.

The flags field contains a number of flags which gives a special meaning to the segment. The flags we will mostly talk about are SYN, ACK, FIN. The other important flag is RST, which is used to abort a connection on the spot.

The window size field contains the number of bytes the host is willing to accept. It's required for flow control.

The checksum contains a code which is calculated over the entire segment and is used to detect corrupted bits.

If you were to ask me which TCP segment field is my favorite, I would certainly tell you it's the urgent pointer! Apparently, TCP allows sending two streams of data over a single connection: one for regular data, and one for urgent/high priority data. When the URG flag is set, the current segment is holding urgent data and the length of that data is the urgent pointer value minus one. The urgent payload spans from the start of the segment up to the byte with sequence number sequence number + urgent pointer. Ahh.. so much to say. I'll have to write a post specifically about this!

The options region of the header may contain configurations which change the default behavior of the protocol.

ACKing and Retransmission

For now, let's skip the connection establishment step to see how a TCP connection behaves when sending regular data.

A TCP connection is made of two unidirectional streams of data, each with its own sequence numbers. Since each host has its own output channel, it's pretty much free to send data whenever it wants. Any synchronization between input and output is handled by higher level protocols.

Each connection is associated with an input buffer and an output buffer. The input buffer holds data received by the network until the application accepts it, and the output buffer contains data sent by the application that is waiting to be forwarded to the network. The knowledge of these buffers is probably the single most important thing for a network programmer.

When a host sends data to the network, it doesn't immediately remove it from the output buffer. Instead, the host assumes some of it will be lost and will need to be sent again. Upon receiving that data, the peer will respond with an ACK segment (a segment with the ACK flag set) telling the sender that the data has been received. Bytes from the output buffer can be freed only after an ACK for them has been received. If no ACK is received by a certain time, the data is assumed to be lost and is retransmitted. When a number of segments are lost, the sender may start assuming the network is congested or the peer is not available at all.

If a host sending an ACK is also about to send a data segment, the two can be merged into one data+ACK segment. This should be pretty intuitive. You just set the ACK flag on the data segment. We call this piggybacking.

Cumulative ACKs and Pipelining

Every ACK segment sent by a host contains an acknowledgement number, which corresponds to the sequence number of the next byte expected by the host. When everything goes well, the acknowledgement number will match the peer's current sequence number. However, the receiver may acknowledge only part of the previously received segment. For instance, the segment may contain more data than the available free space. In this case, the ACK number will refer to the first discarded byte in the data segment and the sender will have to retransmit those bytes by sending a segment which overlaps with the previous one. The ACK number mechanism makes it so the relation between data and ACK segments is not 1-to-1: every ACK refers to all bytes up to a certain point. This is why we call ACKs in TCP "cumulative".

A consequence of cumulative ACKing is that each ACK implies all previous ones. If a host receives two segments, the ACK for the second segment will also ACK the first segment! This completely changes the relation between data segments and ACKs. We don't need to ACK every segment, we just need to send an occasional ACK to let the sender know everything up to this point was received. The ability to send multiple segments before the ACK for the previous ones is received, which we call pipelining, greatly improves performance.

Flow Control

TCP employs a mechanism to regulate data transmission to avoid overwhelming the receiver, which we call flow control. When sending segments, hosts set the window size field to the available free space in their input buffer. This tells the sender that any data over this limit will likely be discarded. With this, senders know how much data is safe to send.

The Three-way Handshake

To use TCP, hosts must first establish a connection. We call this process the three-way handshake.

The host initiating the connection sends a SYN segment (a segment with the SYN flag set). This segment contains its first sequence number. If the peer is open to communicate, it will respond with an ACK+SYN segment, which includes its own first sequence number. The initial host then replies with an ACK segment, which concludes the handshake. If no ACK is sent in response to a SYN, the SYN is retransmitted. Although uncommon, SYNs can piggyback on data segments, in which case the data must be delivered after the handshake has been completed.

In the previous paragraphs we talked about how ACKs use sequence numbers to specify what was received, which implies only things with a sequence number can be ACKed. But then, how do we ACK SYNs? The answer is SYNs are ACKed as any other byte, except there is no byte for it. We call SYN a "ghost byte". Of course this only works since SYN is sent before any data. If we added random ghost bytes in the middle of our stream it would mess up the entire sequence. Similarly, the connection termination process uses the FIN flag which also behave as ghost bytes.

Here's a diagram of the handshake:

Host A                       Host B
  |                            |
  | ---------- SYN ----------> |
  |                            |
  | <------- SYN+ACK --------- |
  |                            |
  | ---------- ACK ----------> |
  |                            |

However, this process is flexible enough that it handles peers simultaneously sending connection requests to each other:

Host A                       Host B
  |                            |
  | ---------- SYN ----------> |
  |                            |
  | <--------- SYN ----------- |
  |                            |
  | <--------- ACK ----------- |
  |                            |
  | ---------- ACK ----------> |
  |                            |

So the handshake is not strictly three-way. We just call it that because it requires three steps most of the time. We can also interpret the SYN+ACK segment as the ACK piggybacking on the SYN segment. Hosts responding to a SYN could just decide to send their own SYN and ACK in different segments.

The Four-Way Termination

Terminating a TCP connection is a similar process to creating one, where we use FIN segments instead of SYN segments. If a host decides it will not send any more data, it must send a FIN segment. The peer must then ACK the FIN segment. FINs only terminate the output channel of a host, which means communication is free to continue in the direction of the host which sent the FIN. When both peers have sent (and acknowledged) FINs, the connection is closed.

The four-way termination looks like this:

Host A                       Host B
  |                            |
  | ---------- FIN ----------> |
  |                            |
  | <--------- ACK ----------- |
  |                            |
  | <----- optional data ----- |
  |                            |
  | <--------- FIN ----------- |
  |                            |
  | ---------- ACK ----------> |
  |                            |

but similarly to handshakes, it can be expressed as a three-way termination:

Host A                       Host B
  |                            |
  | ---------- FIN ----------> |
  |                            |
  | <------- FIN+ACK --------- |
  |                            |
  | ---------- ACK ----------> |
  |                            |

although this version is less common.

Congestion Control

At times, especially due to pipelining, it's possible for networks to get congested. This happens when routers are overwhelmed by the amount of packets in transit, causing them to drop some of them. To avoid this, TCP implements a congestion control mechanism. This consists of keeping track of the "congestion window", which is the maximum number of bytes that the network is able to handle at once. This is not a static value since it depends on the current load of the network, therefore TCP must infer it dynamically based on the behavior of the network. The congestion window value starts small and is increased any time an ACK is received. When three duplicate ACKs are received or there is an ACK timeout, a congestion is assumed and the congestion window is reduced. The specifics change based on the implemented algorithm (like TCP Reno, TCP Tahoe, and many more).

What's next

And that's it! The world's most famous protocol explained in a measly 60-minute read :D

In the next post I'll just give an overview of HTTP. I won't go into much detail since we'll talk about it extensively throughout the entire series.

Join the Discussion!

Have questions or feedback for me? Feel free to pop in my discord

Join the Discord Server