Linux TCP congestion control internals

Linux has a pluggable TCP congestion control architecture: the IPv4 and IPv6 implementations both call a set of functions that implement congestion control. The congestion-control algorithm can be changed system-wide for new connections, or set for individual sockets using setsockopt (more info here). Here, we look at how the TCP implementation interacts with the congestion control algorithms and …

Research is based on kernel v4.4.3

The congestion control interface

Linux’s congestion control algorithm interface is defined in struct tcp_congestion_ops.

Where congestion control state is kept

Connection-oriented sockets (struct inet_connection_sock) contain a fixed amount of space for congestion control state in icsk_ca_priv:

inet_csk_ca(sk) gets a pointer to this space given a socket:

init and release

Initialization is triggered as part of a three way handshake (SYN-> SYN+ACK ->ACK) when:

  • On a listening socket when a TCP fastopen “SYN” packet arrives (tcp_rcv_state_process -> icsk->icsk_af_ops->conn_request -> tcp_{v4,v6}_conn_request -> tcp_conn_request -> tcp_try_fastopen -> tcp_fastopen_create_child).
  • In a TCP client, getting the “SYN+ACK” (called through tcp_rcv_state_process -> tcp_rcv_synsent_state_process -> tcp_finish_connect).
  • In a TCP server, getting the “ACK” (step 5 in tcp_rcv_state_process).

Or when changing the congestion control algorithm (do_tcp_setsockopt -> tcp_set_congestion_control -> tcp_reinit_congestion_control).

Release is called when setsockopt changes algorithms and when the socket is closed (tcp_v4_destroy_sock, also called for IPv6 sockets from tcp_v6_destroy_sock).

set_state and the different TCP states

The calls to congestion control algorithms depend on the current state of the TCP connection and the transitions between states. TCP state is captured in enum tcp_ca_state:

The function tcp_set_ca_state switches between states by first calling the congestion control algorithm’s set_state, then setting icsk->icsk_ca_state:

This way states are arranged in both the enum and masks (TCPF_*) above allows for bitwise checking if a socket’s state is within a set of states. For example,

And to capture all states by severity of disruption it is possible to compare by numerical value:

TCP socket state

State follows definitions in RFC793:

Below are some important state variables. tp-> designates struct tcp_sock (in include/linux/tcp.h). The variables from the RFC are named in lowercase:

  • tp->snd_nxt: the highest seq that was sent on the wire
  • tp->snd_wnd: the peer’s window
  • tp->snd_wl1: the seq number when peer last updated the window
  • tp->rcv_nxt: the next sequence number to be received. This is used in the ACK field of outgoing packets.
  • tp->rcv_wnd: the window size advertised to peer

Other important variables:

  • tp->write_seq: the highest seq written from the user process
  • tp->copied_seq: sequence number that will next be copied to the user.
  • tp->rcv_wup: the sequence number when window was advertised to peer (see more below)
  • tp->tlp_high_seq: zero when no TLP probe has been sent, is set to tp->snd_nxt when sending a TLP probe.
  • tp->prior_ssthresh: saves the previous ssthresh when going into window reduction, for cwnd undoing,  if undoing is allowed.
  • tp->window_clamp: the maximum rcv window that will be advertised.

The flow-control window – what’s tp->rcv_wup (a.k.a RCV.WUP)?

On outgoing packets, a TCP sender includes an ACK, and th->window, the number of bytes after the ACK the sender is allowed to send. Special care is taken not to shrink the window when changing it, so if the other end has already sent some bytes they will fit in the new window.

The kernel keeps the information on the last window update in two variables. tp->rcv_wnd keeps the advertised window, and tp->rcv_wup keeps the last ACK that carried the update. This means that the other end may send until sequence number tp->rcv_wup + tp->rcv_wnd. So, in subsequent window advertisements, the code makes sure that the last permitted seq is not less than that sequence number.

  • tcp_receive_window() returns the number of bytes after tp->rcv_next that the previous window advertisement allowed
  • __tcp_select_window() chooses the window size considering free space,  and trying to change the window as little as possible (for header prediction) while avoiding SWS.
  • tcp_select_window() makes sure that the selected window doesn’t shrink.

Incoming packet handling — tcp_rcv_established

Packets usually flow from the NIC to tcp_v{4,6}_do_rcv(), and when the TCP connection is in ESTABLISHED state, to tcp_rcv_established(). Here, the code follows a Van Jaconson-inspired scheme to quickly process “fast-path” packets, i.e., TCP packets that have no special circumstances: they have the next expected SEQ, carry an ACK in the correct range, and have the anticipated flags. The fast-path code optimizes two cases: ACK-only and data packets.

We now survey the functions that are called in each processing path.

ACK-only packet processing will call tcp_ack(), free the skb, then call tcp_data_snd_check(). For data packets, if the user process is waiting on the application, tcp_copy_to_iovec() tries to copy the data directly to user buffers, otherwise it calls tcp_queue_rcv() to enqueue to socket buffers. The data fast-path ends with the following processing:

The slow path, if successful, calls tcp_ack(), tcp_urg()tcp_data_queue(), tcp_data_snd_check(), and tcp_ack_snd_check().

So in these three code paths (fast-ACK, fast-data, slow):

  • all handle incoming ACK data in tcp_ack()
  • slow path calls tcp_urg()
  • if there is data, need to queue it call tcp_copy_to_iovec(), tcp_queue_rcv() or tcp_data_queue().
  • all call tcp_data_snd_check().
  • if there is data, call __tcp_ack_snd_check() on fast path or tcp_ack_snd_check() on slow path.

Let’s discuss some of these functions.

tcp_data_snd_check() and triggering of data sends

A call to tcp_data_snd_check() triggers the following:

  • A call to tcp_push_pending_frames(), which breaks if tcp_send_head() is NULL, i.e., there is no pending packet in the write_queue that can be transmitted.
  • Calls __tcp_push_pending_frames(), which breaks if the socket is in TCP_CLOSE state
  • Call tcp_write_xmit(). We know tcp_send_head() != NULL, so the function goes through several checks:
    • tcp_cwnd_test: how many packets does the congestion window allow. To first approximation, returns max(0,tp->snd_cwnd – tcp_packets_in_flight(tp))
    • tcp_snd_wnd_test: the first segment in the skb must fit into the send window (i.e., its last seq should be < tcp_wnd_end()).
    • if only one segment, perform Nagle check in tcp_nagle_test()
    • if more than one segment, try to defer to minimize TSO splitting, tcp_tso_should_defer() calls this “a kind of TSO Nagle test”.
    • Do some TSO accounting, and make sure there is no buffer-bloat in the kernel’s qdiscs (a.k.a. TCP Small Queues or TSQ)
  • tcp_trasmit_skb() actually transmits the packet.
  • after all allowed packets have been sent:
    • if socket is in CWR or Recovery state, adjust PRR state.
    • schedule a loss probe (for TLP)
  • If tcp_write_xmit() decided not to send, calls tcp_check_probe_timer()

Acknowledgement handling in tcp_ack()

tcp_ack() first performs a few checks of the ack against socket state. Is it acking previously-acked data (“old ack”)? Is it acking unsent data (“invalid_ack”)? If it is valid it proceeds re-arming the RTO if it is set for EARLY_RETRANS or LOSS_PROBE.

The flags at the start of tcp_ack() can contain:

  • FLAG_DATA: the packet contains data (not a pure ack) in the fast path
  • FLAG_SLOWPATH: header prediction didn’t work on the packet
  • FLAG_UPDATE_TS_RECENT: after verifying validity, ack_tcp() should call tcp_replace_ts_recent() to update the timestamp (tp->rx_opt.ts_recent and tp->rx_opt.ts_recent_stamp).

The flags variable is updates throughput the run:

  • FLAG_SND_UNA_ADVANCED: if this ack increases tp->snd_una, i.e., it acks previously unacked packets.
  • FLAG_DATA: added in tcp_ack() if in the slow path and packet contains data.
  • FLAG_WIN_UPDATE: if the right boundary of the peer’s advertised window might have moved (note there could be false positives in the slow path).
  • FLAG_ECE: ECN Echo bit was marked on the packet.

tcp_clean_rtx_queue() sets the following flags:

  • FLAG_RETRANS_DATA_ACKED: “This ACK acknowledged new data some of which was retransmitted”
  • FLAG_ORIG_SACK_ACKED: “Never retransmitted data are (s)acked”
  • FLAG_DATA_ACKED: previously unacknowledged data is acknowledged by the packet
  • FLAG_SYN_ACKED: the ACK acknowledges a SYN.
  • FLAG_SACK_RENEGING: the ACK acknowledges up to some packet, but the subsequent SEQ has been SACKed; this means the peer must have dropped a packet it has SACKed (otherwise the ACK would have included that packet too).

If header prediction succeeds, the packets enters the fast path, and the prediction guarantees the peer’s advertised window has not changed. So, in the fast path only tp->snd_wl1 and tp->snd_una need to be updated. In the slow path, tcp_ack_update_window() checks that the packet contains fresher information, and if so updates these variables and also tp->snd_wnd, and recomputes the fast-path prediction for the next packet. Slow-path packets might have SACKs, so these are processed next, then check for ECN Echo bit. tcp_clean_rtx_queue() frees packets from the write queue that are acked, so have arrived at the destination.

A dubious ack (see tcp_ack_is_dubious()) is an ack where packets:

  • carry the FLAG_CA_ALERT flag, or
  • do not carry any of the flags specified by FLAG_NOT_DUP, i.e., contain none of  FLAG_DATA, FLAG_WIN_UPDATE, FLAG_DATA_ACKED, or FLAG_SYN_ACKED. or,
  • are in any connection not in TCP_CA_Open state

When acks are considered “dubious”, tcp_ack() calls tcp_fastretrans_alert() which we cover in a later section.

After processing dubious acks, tcp_ack() might call cong_avoid(), discussed next.

cong_avoid

Called from tcp_ack() if tcp_may_raise_cwnd() returns true. This requires:

  • The socket must not to be in CWR or Recovery states
  • Some  progress has been made. Usually this means that some packets were acknowledged so FLAG_DATA_ACKED is set: tcp_clean_rtx_queue() counts how many packets were fully acknowledged by the ACK and removes them from the retransmit queue; if packets are removed, it sets FLAG_DATA_ACKED. When reordering in the network exceeds a threshold, a wider definition of progress is used (see FLAG_FORWARD_PROGRESS, basically this also counts new SACK’d packets).

 

tcp_fastretrans_alert()

  • On incoming packets with ECE marking, disable cwnd undoing.
  • If SACK reneging was detected (FLAG_SACK_RENEGING was set in tcp_clean_rtx_queue()), sets a short retransmission timeout to allow for ACKs to arrive for these packets, or otherwise the timeout will reset SACK state. If reneging is suspected, no further processing is done in tcp_fastretrans_alert().
  • If in CWR state and ACK is above tp->high_seq, will call tcp_end_cwnd_reduction() which resets tp->snd_cwnd to tp->snd_ssthresh, then move to TCP_CA_Open state.
  • If in Recovery state, and ACK is equal or above tp->high_seq, will call tcp_try_undo_recovery()

cwnd_event and TCP events

The TCP stack reports some events with tcp_ca_event(), which are then propagated to the congestion control’s cwnd_event:

The event definitions also carries short descriptions:

In v4.4.3, calls to tcp_ca_event() appear in 8 source lines, one for each of the event types.

  • CA_EVENT_TX_START from tcp_event_data_sent()
  • CA_EVENT_CWND_RESTART from tcp_cwnd_restart()
  • CA_EVENT_COMPLETE_CWR from tcp_end_cwnd_reduction()
  • CA_EVENT_LOSS from tcp_enter_loss()
  • CA_EVENT_ECN_NO_CE, CA_EVENT_ECN_IS_CE from __tcp_ecn_check_ce()
  • CA_EVENT_DELAYED_ACK from tcp_send_delayed_ack()
  • CA_EVENT_NON_DELAYED_ACK from tcp_send_ack()

ssthresh

ssthresh is called from two functions: tcp_init_cwnd_reduction and tcp_enter_loss. The return value sets tp->snd_ssthresh, where tp points to struct tcp_sock.

in_ack_event

Called from tcp_in_ack_event. In the v4.4.3 kernel, only DCTCP and Westwood implement this callback. This is where DCTCP maintains counters for all acknowledged bytes and all ECN-marked bytes, and updates is alpha estimate each RTT (see dctcp_update_alpha()).

Whereas cong_avoid() is called only when tcp_may_raise_cwnd() is true, in_ack_event() is called for every ACK, and the call is before cong_avoid() in ACK processing so congestion control algorithms that implement both can expect a call to in_ack_event() before every cong_avoid().

The flag argument can have these flags:

In the fast-path where the ACK acknowledges a new segment, the callback will receive only CA_ACK_WIN_UPDATE. Any other case will have at least CA_ACK_SLOWPATH.

undo_cwnd

Call path is tcp_undo_cwnd_reduction, called from:

  • tcp_try_undo_recovery
  • tcp_try_undo_dsack: “Try to undo cwnd reduction, because D-SACKs acked all retransmitted data”
  • tcp_try_undo_loss: “Undo during loss recovery after partial ACK or using F-RTO.”
  • tcp_try_undo_partial from tcp_fastretrans_alert() : “Undo during fast recovery after partial ACK.”

Used in the bic, cdg, cubic, and htcp algorithms.

pkts_acked

Called from tcp_ack -> tcp_clean_rtx_queue. This seems to be called for every ACK-only packet, and every valid slow-path packet with a valid ack value, even when no new packets have been acked. For fast-path packets with data, tcp_ack is only called if ACK_SEQ != SND_UNA, i.e., it is either an old ack or acks more bytes.

get_info

This is used when dumping diagnostics, i.e., dump() and dump_one() in tcp_diag_handler.

Resources

Posted in Tech Blog.

Leave a Reply

Your email address will not be published. Required fields are marked *