Linux has a pluggable TCP congestion control architecture: the IPv4 and IPv6 implementations both call a set of functions that implement congestion control. The congestion-control algorithm can be changed system-wide for new connections, or set for individual sockets using setsockopt (more info here). Here, we look at how the TCP implementation interacts with the congestion control algorithms and …
Research is based on kernel v4.4.3
The congestion control interface
Linux’s congestion control algorithm interface is defined in struct tcp_congestion_ops.
struct tcp_congestion_ops { struct list_head list; u32 key; u32 flags; /* initialize private data (optional) */ void (*init)(struct sock *sk); /* cleanup private data (optional) */ void (*release)(struct sock *sk); /* return slow start threshold (required) */ u32 (*ssthresh)(struct sock *sk); /* do new cwnd calculation (required) */ void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked); /* call before changing ca_state (optional) */ void (*set_state)(struct sock *sk, u8 new_state); /* call when cwnd event occurs (optional) */ void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev); /* call when ack arrives (optional) */ void (*in_ack_event)(struct sock *sk, u32 flags); /* new value of cwnd after loss (optional) */ u32 (*undo_cwnd)(struct sock *sk); /* hook for packet ack accounting (optional) */ void (*pkts_acked)(struct sock *sk, u32 num_acked, s32 rtt_us); /* get info for inet_diag (optional) */ size_t (*get_info)(struct sock *sk, u32 ext, int *attr, union tcp_cc_info *info); char name[TCP_CA_NAME_MAX]; struct module *owner; };
Where congestion control state is kept
Connection-oriented sockets (struct inet_connection_sock) contain a fixed amount of space for congestion control state in icsk_ca_priv:
struct inet_connection_sock { /* I removed some lines - JP */ u64 icsk_ca_priv[64 / sizeof(u64)]; #define ICSK_CA_PRIV_SIZE (8 * sizeof(u64)) };
inet_csk_ca(sk) gets a pointer to this space given a socket:
static inline void *inet_csk_ca(const struct sock *sk) { return (void *)inet_csk(sk)->icsk_ca_priv; }
init and release
Initialization is triggered as part of a three way handshake (SYN-> SYN+ACK ->ACK) when:
- On a listening socket when a TCP fastopen “SYN” packet arrives (tcp_rcv_state_process -> icsk->icsk_af_ops->conn_request -> tcp_{v4,v6}_conn_request -> tcp_conn_request -> tcp_try_fastopen -> tcp_fastopen_create_child).
- In a TCP client, getting the “SYN+ACK” (called through tcp_rcv_state_process -> tcp_rcv_synsent_state_process -> tcp_finish_connect).
- In a TCP server, getting the “ACK” (step 5 in tcp_rcv_state_process).
Or when changing the congestion control algorithm (do_tcp_setsockopt -> tcp_set_congestion_control -> tcp_reinit_congestion_control).
Release is called when setsockopt changes algorithms and when the socket is closed (tcp_v4_destroy_sock, also called for IPv6 sockets from tcp_v6_destroy_sock).
set_state and the different TCP states
The calls to congestion control algorithms depend on the current state of the TCP connection and the transitions between states. TCP state is captured in enum tcp_ca_state:
enum tcp_ca_state { TCP_CA_Open = 0, #define TCPF_CA_Open (1<<TCP_CA_Open) TCP_CA_Disorder = 1, #define TCPF_CA_Disorder (1<<TCP_CA_Disorder) TCP_CA_CWR = 2, #define TCPF_CA_CWR (1<<TCP_CA_CWR) TCP_CA_Recovery = 3, #define TCPF_CA_Recovery (1<<TCP_CA_Recovery) TCP_CA_Loss = 4 #define TCPF_CA_Loss (1<<TCP_CA_Loss) };
The function tcp_set_ca_state switches between states by first calling the congestion control algorithm’s set_state, then setting icsk->icsk_ca_state:
static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state) { struct inet_connection_sock *icsk = inet_csk(sk); if (icsk->icsk_ca_ops->set_state) icsk->icsk_ca_ops->set_state(sk, ca_state); icsk->icsk_ca_state = ca_state; }
This way states are arranged in both the enum and masks (TCPF_*) above allows for bitwise checking if a socket’s state is within a set of states. For example,
static inline bool tcp_in_cwnd_reduction(const struct sock *sk) { return (TCPF_CA_CWR | TCPF_CA_Recovery) & (1 << inet_csk(sk)->icsk_ca_state); }
And to capture all states by severity of disruption it is possible to compare by numerical value:
if (icsk->icsk_ca_state <= TCP_CA_Disorder) tcp_try_undo_dsack(sk);
TCP socket state
State follows definitions in RFC793:
Send Sequence Variables SND.UNA - send unacknowledged SND.NXT - send next SND.WND - send window SND.UP - send urgent pointer SND.WL1 - segment sequence number used for last window update SND.WL2 - segment acknowledgment number used for last window update ISS - initial send sequence number Receive Sequence Variables RCV.NXT - receive next RCV.WND - receive window RCV.UP - receive urgent pointer IRS - initial receive sequence number
Below are some important state variables. tp-> designates struct tcp_sock (in include/linux/tcp.h). The variables from the RFC are named in lowercase:
- tp->snd_nxt: the highest seq that was sent on the wire
- tp->snd_wnd: the peer’s window
- tp->snd_wl1: the seq number when peer last updated the window
- tp->rcv_nxt: the next sequence number to be received. This is used in the ACK field of outgoing packets.
- tp->rcv_wnd: the window size advertised to peer
Other important variables:
- tp->write_seq: the highest seq written from the user process
- tp->copied_seq: sequence number that will next be copied to the user.
- tp->rcv_wup: the sequence number when window was advertised to peer (see more below)
- tp->tlp_high_seq: zero when no TLP probe has been sent, is set to tp->snd_nxt when sending a TLP probe.
- tp->prior_ssthresh: saves the previous ssthresh when going into window reduction, for cwnd undoing, if undoing is allowed.
- tp->window_clamp: the maximum rcv window that will be advertised.
The flow-control window – what’s tp->rcv_wup (a.k.a RCV.WUP)?
On outgoing packets, a TCP sender includes an ACK, and th->window, the number of bytes after the ACK the sender is allowed to send. Special care is taken not to shrink the window when changing it, so if the other end has already sent some bytes they will fit in the new window.
The kernel keeps the information on the last window update in two variables. tp->rcv_wnd keeps the advertised window, and tp->rcv_wup keeps the last ACK that carried the update. This means that the other end may send until sequence number tp->rcv_wup + tp->rcv_wnd. So, in subsequent window advertisements, the code makes sure that the last permitted seq is not less than that sequence number.
- tcp_receive_window() returns the number of bytes after tp->rcv_next that the previous window advertisement allowed
- __tcp_select_window() chooses the window size considering free space, and trying to change the window as little as possible (for header prediction) while avoiding SWS.
- tcp_select_window() makes sure that the selected window doesn’t shrink.
Incoming packet handling — tcp_rcv_established
Packets usually flow from the NIC to tcp_v{4,6}_do_rcv(), and when the TCP connection is in ESTABLISHED state, to tcp_rcv_established(). Here, the code follows a Van Jaconson-inspired scheme to quickly process “fast-path” packets, i.e., TCP packets that have no special circumstances: they have the next expected SEQ, carry an ACK in the correct range, and have the anticipated flags. The fast-path code optimizes two cases: ACK-only and data packets.
We now survey the functions that are called in each processing path.
ACK-only packet processing will call tcp_ack(), free the skb, then call tcp_data_snd_check(). For data packets, if the user process is waiting on the application, tcp_copy_to_iovec() tries to copy the data directly to user buffers, otherwise it calls tcp_queue_rcv() to enqueue to socket buffers. The data fast-path ends with the following processing:
tcp_event_data_recv(sk, skb); if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) { /* Well, only one small jumplet in fast path... */ tcp_ack(sk, skb, FLAG_DATA); tcp_data_snd_check(sk); if (!inet_csk_ack_scheduled(sk)) goto no_ack; } __tcp_ack_snd_check(sk, 0); no_ack: if (eaten) kfree_skb_partial(skb, fragstolen); sk->sk_data_ready(sk); return;
The slow path, if successful, calls tcp_ack(), tcp_urg(), tcp_data_queue(), tcp_data_snd_check(), and tcp_ack_snd_check().
So in these three code paths (fast-ACK, fast-data, slow):
- all handle incoming ACK data in tcp_ack()
- slow path calls tcp_urg()
- if there is data, need to queue it call tcp_copy_to_iovec(), tcp_queue_rcv() or tcp_data_queue().
- all call tcp_data_snd_check().
- if there is data, call __tcp_ack_snd_check() on fast path or tcp_ack_snd_check() on slow path.
Let’s discuss some of these functions.
tcp_data_snd_check() and triggering of data sends
A call to tcp_data_snd_check() triggers the following:
- A call to tcp_push_pending_frames(), which breaks if tcp_send_head() is NULL, i.e., there is no pending packet in the write_queue that can be transmitted.
- Calls __tcp_push_pending_frames(), which breaks if the socket is in TCP_CLOSE state
- Call tcp_write_xmit(). We know tcp_send_head() != NULL, so the function goes through several checks:
- tcp_cwnd_test: how many packets does the congestion window allow. To first approximation, returns max(0,tp->snd_cwnd – tcp_packets_in_flight(tp))
- tcp_snd_wnd_test: the first segment in the skb must fit into the send window (i.e., its last seq should be < tcp_wnd_end()).
- if only one segment, perform Nagle check in tcp_nagle_test()
- if more than one segment, try to defer to minimize TSO splitting, tcp_tso_should_defer() calls this “a kind of TSO Nagle test”.
- Do some TSO accounting, and make sure there is no buffer-bloat in the kernel’s qdiscs (a.k.a. TCP Small Queues or TSQ)
- tcp_trasmit_skb() actually transmits the packet.
- after all allowed packets have been sent:
- If tcp_write_xmit() decided not to send, calls tcp_check_probe_timer()
Acknowledgement handling in tcp_ack()
tcp_ack() first performs a few checks of the ack against socket state. Is it acking previously-acked data (“old ack”)? Is it acking unsent data (“invalid_ack”)? If it is valid it proceeds re-arming the RTO if it is set for EARLY_RETRANS or LOSS_PROBE.
The flags at the start of tcp_ack() can contain:
- FLAG_DATA: the packet contains data (not a pure ack) in the fast path
- FLAG_SLOWPATH: header prediction didn’t work on the packet
- FLAG_UPDATE_TS_RECENT: after verifying validity, ack_tcp() should call tcp_replace_ts_recent() to update the timestamp (tp->rx_opt.ts_recent and tp->rx_opt.ts_recent_stamp).
The flags variable is updates throughput the run:
- FLAG_SND_UNA_ADVANCED: if this ack increases tp->snd_una, i.e., it acks previously unacked packets.
- FLAG_DATA: added in tcp_ack() if in the slow path and packet contains data.
- FLAG_WIN_UPDATE: if the right boundary of the peer’s advertised window might have moved (note there could be false positives in the slow path).
- FLAG_ECE: ECN Echo bit was marked on the packet.
tcp_clean_rtx_queue() sets the following flags:
- FLAG_RETRANS_DATA_ACKED: “This ACK acknowledged new data some of which was retransmitted”
- FLAG_ORIG_SACK_ACKED: “Never retransmitted data are (s)acked”
- FLAG_DATA_ACKED: previously unacknowledged data is acknowledged by the packet
- FLAG_SYN_ACKED: the ACK acknowledges a SYN.
- FLAG_SACK_RENEGING: the ACK acknowledges up to some packet, but the subsequent SEQ has been SACKed; this means the peer must have dropped a packet it has SACKed (otherwise the ACK would have included that packet too).
If header prediction succeeds, the packets enters the fast path, and the prediction guarantees the peer’s advertised window has not changed. So, in the fast path only tp->snd_wl1 and tp->snd_una need to be updated. In the slow path, tcp_ack_update_window() checks that the packet contains fresher information, and if so updates these variables and also tp->snd_wnd, and recomputes the fast-path prediction for the next packet. Slow-path packets might have SACKs, so these are processed next, then check for ECN Echo bit. tcp_clean_rtx_queue() frees packets from the write queue that are acked, so have arrived at the destination.
A dubious ack (see tcp_ack_is_dubious()) is an ack where packets:
- carry the FLAG_CA_ALERT flag, or
- do not carry any of the flags specified by FLAG_NOT_DUP, i.e., contain none of FLAG_DATA, FLAG_WIN_UPDATE, FLAG_DATA_ACKED, or FLAG_SYN_ACKED. or,
- are in any connection not in TCP_CA_Open state
When acks are considered “dubious”, tcp_ack() calls tcp_fastretrans_alert() which we cover in a later section.
After processing dubious acks, tcp_ack() might call cong_avoid(), discussed next.
cong_avoid
Called from tcp_ack() if tcp_may_raise_cwnd() returns true. This requires:
- The socket must not to be in CWR or Recovery states
- Some progress has been made. Usually this means that some packets were acknowledged so FLAG_DATA_ACKED is set: tcp_clean_rtx_queue() counts how many packets were fully acknowledged by the ACK and removes them from the retransmit queue; if packets are removed, it sets FLAG_DATA_ACKED. When reordering in the network exceeds a threshold, a wider definition of progress is used (see FLAG_FORWARD_PROGRESS, basically this also counts new SACK’d packets).
tcp_fastretrans_alert()
- On incoming packets with ECE marking, disable cwnd undoing.
- If SACK reneging was detected (FLAG_SACK_RENEGING was set in tcp_clean_rtx_queue()), sets a short retransmission timeout to allow for ACKs to arrive for these packets, or otherwise the timeout will reset SACK state. If reneging is suspected, no further processing is done in tcp_fastretrans_alert().
- If in CWR state and ACK is above tp->high_seq, will call tcp_end_cwnd_reduction() which resets tp->snd_cwnd to tp->snd_ssthresh, then move to TCP_CA_Open state.
- If in Recovery state, and ACK is equal or above tp->high_seq, will call tcp_try_undo_recovery()
cwnd_event and TCP events
The TCP stack reports some events with tcp_ca_event(), which are then propagated to the congestion control’s cwnd_event:
static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event) { const struct inet_connection_sock *icsk = inet_csk(sk); if (icsk->icsk_ca_ops->cwnd_event) icsk->icsk_ca_ops->cwnd_event(sk, event); }
The event definitions also carries short descriptions:
/* Events passed to congestion control interface */ enum tcp_ca_event { CA_EVENT_TX_START, /* first transmit when no packets in flight */ CA_EVENT_CWND_RESTART, /* congestion window restart */ CA_EVENT_COMPLETE_CWR, /* end of congestion recovery */ CA_EVENT_LOSS, /* loss timeout */ CA_EVENT_ECN_NO_CE, /* ECT set, but not CE marked */ CA_EVENT_ECN_IS_CE, /* received CE marked IP packet */ CA_EVENT_DELAYED_ACK, /* Delayed ack is sent */ CA_EVENT_NON_DELAYED_ACK, };
In v4.4.3, calls to tcp_ca_event() appear in 8 source lines, one for each of the event types.
- CA_EVENT_TX_START from tcp_event_data_sent()
- CA_EVENT_CWND_RESTART from tcp_cwnd_restart()
- CA_EVENT_COMPLETE_CWR from tcp_end_cwnd_reduction()
- CA_EVENT_LOSS from tcp_enter_loss()
- CA_EVENT_ECN_NO_CE, CA_EVENT_ECN_IS_CE from __tcp_ecn_check_ce()
- CA_EVENT_DELAYED_ACK from tcp_send_delayed_ack()
- CA_EVENT_NON_DELAYED_ACK from tcp_send_ack()
ssthresh
ssthresh is called from two functions: tcp_init_cwnd_reduction and tcp_enter_loss. The return value sets tp->snd_ssthresh, where tp points to struct tcp_sock.
in_ack_event
Called from tcp_in_ack_event. In the v4.4.3 kernel, only DCTCP and Westwood implement this callback. This is where DCTCP maintains counters for all acknowledged bytes and all ECN-marked bytes, and updates is alpha estimate each RTT (see dctcp_update_alpha()).
Whereas cong_avoid() is called only when tcp_may_raise_cwnd() is true, in_ack_event() is called for every ACK, and the call is before cong_avoid() in ACK processing so congestion control algorithms that implement both can expect a call to in_ack_event() before every cong_avoid().
The flag argument can have these flags:
/* Information about inbound ACK, passed to cong_ops->in_ack_event() */ enum tcp_ca_ack_event_flags { CA_ACK_SLOWPATH = (1 << 0), /* In slow path processing */ CA_ACK_WIN_UPDATE = (1 << 1), /* ACK updated window */ CA_ACK_ECE = (1 << 2), /* ECE bit is set on ack */ };
In the fast-path where the ACK acknowledges a new segment, the callback will receive only CA_ACK_WIN_UPDATE. Any other case will have at least CA_ACK_SLOWPATH.
undo_cwnd
Call path is tcp_undo_cwnd_reduction, called from:
- tcp_try_undo_recovery
- tcp_try_undo_dsack: “Try to undo cwnd reduction, because D-SACKs acked all retransmitted data”
- tcp_try_undo_loss: “Undo during loss recovery after partial ACK or using F-RTO.”
- tcp_try_undo_partial from tcp_fastretrans_alert() : “Undo during fast recovery after partial ACK.”
Used in the bic, cdg, cubic, and htcp algorithms.
pkts_acked
Called from tcp_ack -> tcp_clean_rtx_queue. This seems to be called for every ACK-only packet, and every valid slow-path packet with a valid ack value, even when no new packets have been acked. For fast-path packets with data, tcp_ack is only called if ACK_SEQ != SND_UNA, i.e., it is either an old ack or acks more bytes.
get_info
This is used when dumping diagnostics, i.e., dump() and dump_one() in tcp_diag_handler.
Resources
- Arianfar’s “TCP’s Congestion Control Implementation in Linux Kernel” focuses on describing the TCP mechanisms and important functions, then looks at how Cubic is implemented.
Great work.
Some questions,
1. How you sort out the invocation chain, e.g., for “undo_cwnd”, it’s called by “try_undo_recovery”, etc.
2. What tool do you use to browse the code base ?
3. Any great reference/book do you recommend on TCP in the kernel?
TIA.
Hi TIA
1. I read the code. See answer to 2.
2. Eclipse can index kernel code to allow you to browse more freely (see my previous blog post).
3. I’m not familiar with a book that addresses TCP particularly, however two books I found extremely useful are Daniel Bovet and Marco Cesati’s “Understanding the Linux Kernel” and Christian Benvenuti’s “Understanding Linux Network Internals”.
Jonathan
You have done a Fantastic work here.
I am currently working on congestion control in SDN. I looking for how to write applications on POX controller that will modify the content of some of the modules you have stated above and I am hoping if there is any sort of assistance you can render.
I anticipate your speedy response.
Kind Regards,
Sorry, wasn’t speedy here… Please reach out via email if there’s anything specific!