]>
Commit | Line | Data |
---|---|---|
9d7bcfc6 SH |
1 | TCP protocol |
2 | ============ | |
3 | ||
1e0ce2a1 | 4 | Last updated: 3 June 2017 |
9d7bcfc6 SH |
5 | |
6 | Contents | |
7 | ======== | |
8 | ||
9 | - Congestion control | |
10 | - How the new TCP output machine [nyi] works | |
11 | ||
12 | Congestion control | |
13 | ================== | |
14 | ||
15 | The following variables are used in the tcp_sock for congestion control: | |
16 | snd_cwnd The size of the congestion window | |
17 | snd_ssthresh Slow start threshold. We are in slow start if | |
18 | snd_cwnd is less than this. | |
19 | snd_cwnd_cnt A counter used to slow down the rate of increase | |
20 | once we exceed slow start threshold. | |
21 | snd_cwnd_clamp This is the maximum size that snd_cwnd can grow to. | |
22 | snd_cwnd_stamp Timestamp for when congestion window last validated. | |
23 | snd_cwnd_used Used as a highwater mark for how much of the | |
24 | congestion window is in use. It is used to adjust | |
25 | snd_cwnd down when the link is limited by the | |
26 | application rather than the network. | |
27 | ||
28 | As of 2.6.13, Linux supports pluggable congestion control algorithms. | |
29 | A congestion control mechanism can be registered through functions in | |
30 | tcp_cong.c. The functions used by the congestion control mechanism are | |
31 | registered via passing a tcp_congestion_ops struct to | |
1e0ce2a1 AS |
32 | tcp_register_congestion_control. As a minimum, the congestion control |
33 | mechanism must provide a valid name and must implement either ssthresh, | |
34 | cong_avoid and undo_cwnd hooks or the "omnipotent" cong_control hook. | |
1da177e4 | 35 | |
9d7bcfc6 SH |
36 | Private data for a congestion control mechanism is stored in tp->ca_priv. |
37 | tcp_ca(tp) returns a pointer to this space. This is preallocated space - it | |
38 | is important to check the size of your private data will fit this space, or | |
1e0ce2a1 | 39 | alternatively, space could be allocated elsewhere and a pointer to it could |
9d7bcfc6 SH |
40 | be stored here. |
41 | ||
42 | There are three kinds of congestion control algorithms currently: The | |
43 | simplest ones are derived from TCP reno (highspeed, scalable) and just | |
1e0ce2a1 | 44 | provide an alternative congestion window calculation. More complex |
9d7bcfc6 SH |
45 | ones like BIC try to look at other events to provide better |
46 | heuristics. There are also round trip time based algorithms like | |
47 | Vegas and Westwood+. | |
48 | ||
49 | Good TCP congestion control is a complex problem because the algorithm | |
50 | needs to maintain fairness and performance. Please review current | |
51 | research and RFC's before developing new modules. | |
52 | ||
1e0ce2a1 AS |
53 | The default congestion control mechanism is chosen based on the |
54 | DEFAULT_TCP_CONG Kconfig parameter. If you really want a particular default | |
55 | value then you can set it using sysctl net.ipv4.tcp_congestion_control. The | |
56 | module will be autoloaded if needed and you will get the expected protocol. If | |
57 | you ask for an unknown congestion method, then the sysctl attempt will fail. | |
9d7bcfc6 | 58 | |
1e0ce2a1 | 59 | If you remove a TCP congestion control module, then you will get the next |
84eb8d06 | 60 | available one. Since reno cannot be built as a module, and cannot be |
1e0ce2a1 | 61 | removed, it will always be available. |
9d7bcfc6 SH |
62 | |
63 | How the new TCP output machine [nyi] works. | |
64 | =========================================== | |
1da177e4 LT |
65 | |
66 | Data is kept on a single queue. The skb->users flag tells us if the frame is | |
67 | one that has been queued already. To add a frame we throw it on the end. Ack | |
68 | walks down the list from the start. | |
69 | ||
70 | We keep a set of control flags | |
71 | ||
72 | ||
73 | sk->tcp_pend_event | |
74 | ||
75 | TCP_PEND_ACK Ack needed | |
76 | TCP_ACK_NOW Needed now | |
77 | TCP_WINDOW Window update check | |
78 | TCP_WINZERO Zero probing | |
79 | ||
80 | ||
81 | sk->transmit_queue The transmission frame begin | |
82 | sk->transmit_new First new frame pointer | |
83 | sk->transmit_end Where to add frames | |
84 | ||
85 | sk->tcp_last_tx_ack Last ack seen | |
86 | sk->tcp_dup_ack Dup ack count for fast retransmit | |
87 | ||
88 | ||
89 | Frames are queued for output by tcp_write. We do our best to send the frames | |
90 | off immediately if possible, but otherwise queue and compute the body | |
91 | checksum in the copy. | |
92 | ||
93 | When a write is done we try to clear any pending events and piggy back them. | |
94 | If the window is full we queue full sized frames. On the first timeout in | |
95 | zero window we split this. | |
96 | ||
97 | On a timer we walk the retransmit list to send any retransmits, update the | |
98 | backoff timers etc. A change of route table stamp causes a change of header | |
99 | and recompute. We add any new tcp level headers and refinish the checksum | |
100 | before sending. | |
101 |