]>
Commit | Line | Data |
---|---|---|
cc8889ae WB |
1 | |
2 | ============ | |
3 | MSG_ZEROCOPY | |
4 | ============ | |
5 | ||
6 | Intro | |
7 | ===== | |
8 | ||
9 | The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. | |
31a1b8d5 | 10 | The feature is currently implemented for TCP and UDP sockets. |
cc8889ae WB |
11 | |
12 | ||
13 | Opportunity and Caveats | |
14 | ----------------------- | |
15 | ||
16 | Copying large buffers between user process and kernel can be | |
17 | expensive. Linux supports various interfaces that eschew copying, | |
18 | such as sendpage and splice. The MSG_ZEROCOPY flag extends the | |
19 | underlying copy avoidance mechanism to common socket send calls. | |
20 | ||
21 | Copy avoidance is not a free lunch. As implemented, with page pinning, | |
22 | it replaces per byte copy cost with page accounting and completion | |
23 | notification overhead. As a result, MSG_ZEROCOPY is generally only | |
24 | effective at writes over around 10 KB. | |
25 | ||
26 | Page pinning also changes system call semantics. It temporarily shares | |
27 | the buffer between process and network stack. Unlike with copying, the | |
28 | process cannot immediately overwrite the buffer after system call | |
29 | return without possibly modifying the data in flight. Kernel integrity | |
30 | is not affected, but a buggy program can possibly corrupt its own data | |
31 | stream. | |
32 | ||
33 | The kernel returns a notification when it is safe to modify data. | |
34 | Converting an existing application to MSG_ZEROCOPY is not always as | |
35 | trivial as just passing the flag, then. | |
36 | ||
37 | ||
38 | More Info | |
39 | --------- | |
40 | ||
41 | Much of this document was derived from a longer paper presented at | |
42 | netdev 2.1. For more in-depth information see that paper and talk, | |
43 | the excellent reporting over at LWN.net or read the original code. | |
44 | ||
45 | paper, slides, video | |
46 | https://netdevconf.org/2.1/session.html?debruijn | |
47 | ||
48 | LWN article | |
49 | https://lwn.net/Articles/726917/ | |
50 | ||
51 | patchset | |
52 | [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY | |
25208dd8 | 53 | https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com |
cc8889ae WB |
54 | |
55 | ||
56 | Interface | |
57 | ========= | |
58 | ||
59 | Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy | |
60 | avoidance, but not the only one. | |
61 | ||
62 | Socket Setup | |
63 | ------------ | |
64 | ||
65 | The kernel is permissive when applications pass undefined flags to the | |
66 | send system call. By default it simply ignores these. To avoid enabling | |
67 | copy avoidance mode for legacy processes that accidentally already pass | |
68 | this flag, a process must first signal intent by setting a socket option: | |
69 | ||
70 | :: | |
71 | ||
72 | if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) | |
73 | error(1, errno, "setsockopt zerocopy"); | |
74 | ||
cc8889ae WB |
75 | Transmission |
76 | ------------ | |
77 | ||
78 | The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. | |
79 | Pass the new flag. | |
80 | ||
81 | :: | |
82 | ||
83 | ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); | |
84 | ||
85 | A zerocopy failure will return -1 with errno ENOBUFS. This happens if | |
86 | the socket option was not set, the socket exceeds its optmem limit or | |
87 | the user exceeds its ulimit on locked pages. | |
88 | ||
89 | ||
90 | Mixing copy avoidance and copying | |
91 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
92 | ||
93 | Many workloads have a mixture of large and small buffers. Because copy | |
94 | avoidance is more expensive than copying for small packets, the | |
95 | feature is implemented as a flag. It is safe to mix calls with the flag | |
96 | with those without. | |
97 | ||
98 | ||
99 | Notifications | |
100 | ------------- | |
101 | ||
102 | The kernel has to notify the process when it is safe to reuse a | |
103 | previously passed buffer. It queues completion notifications on the | |
104 | socket error queue, akin to the transmit timestamping interface. | |
105 | ||
106 | The notification itself is a simple scalar value. Each socket | |
107 | maintains an internal unsigned 32-bit counter. Each send call with | |
108 | MSG_ZEROCOPY that successfully sends data increments the counter. The | |
109 | counter is not incremented on failure or if called with length zero. | |
110 | The counter counts system call invocations, not bytes. It wraps after | |
111 | UINT_MAX calls. | |
112 | ||
113 | ||
114 | Notification Reception | |
115 | ~~~~~~~~~~~~~~~~~~~~~~ | |
116 | ||
117 | The below snippet demonstrates the API. In the simplest case, each | |
118 | send syscall is followed by a poll and recvmsg on the error queue. | |
119 | ||
120 | Reading from the error queue is always a non-blocking operation. The | |
121 | poll call is there to block until an error is outstanding. It will set | |
122 | POLLERR in its output flags. That flag does not have to be set in the | |
123 | events field. Errors are signaled unconditionally. | |
124 | ||
125 | :: | |
126 | ||
127 | pfd.fd = fd; | |
128 | pfd.events = 0; | |
129 | if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) | |
130 | error(1, errno, "poll"); | |
131 | ||
132 | ret = recvmsg(fd, &msg, MSG_ERRQUEUE); | |
133 | if (ret == -1) | |
134 | error(1, errno, "recvmsg"); | |
135 | ||
136 | read_notification(msg); | |
137 | ||
138 | The example is for demonstration purpose only. In practice, it is more | |
139 | efficient to not wait for notifications, but read without blocking | |
140 | every couple of send calls. | |
141 | ||
142 | Notifications can be processed out of order with other operations on | |
143 | the socket. A socket that has an error queued would normally block | |
144 | other operations until the error is read. Zerocopy notifications have | |
145 | a zero error code, however, to not block send and recv calls. | |
146 | ||
147 | ||
148 | Notification Batching | |
149 | ~~~~~~~~~~~~~~~~~~~~~ | |
150 | ||
151 | Multiple outstanding packets can be read at once using the recvmmsg | |
152 | call. This is often not needed. In each message the kernel returns not | |
153 | a single value, but a range. It coalesces consecutive notifications | |
154 | while one is outstanding for reception on the error queue. | |
155 | ||
156 | When a new notification is about to be queued, it checks whether the | |
157 | new value extends the range of the notification at the tail of the | |
158 | queue. If so, it drops the new notification packet and instead increases | |
159 | the range upper value of the outstanding notification. | |
160 | ||
161 | For protocols that acknowledge data in-order, like TCP, each | |
162 | notification can be squashed into the previous one, so that no more | |
163 | than one notification is outstanding at any one point. | |
164 | ||
165 | Ordered delivery is the common case, but not guaranteed. Notifications | |
166 | may arrive out of order on retransmission and socket teardown. | |
167 | ||
168 | ||
169 | Notification Parsing | |
170 | ~~~~~~~~~~~~~~~~~~~~ | |
171 | ||
172 | The below snippet demonstrates how to parse the control message: the | |
173 | read_notification() call in the previous snippet. A notification | |
174 | is encoded in the standard error format, sock_extended_err. | |
175 | ||
176 | The level and type fields in the control data are protocol family | |
177 | specific, IP_RECVERR or IPV6_RECVERR. | |
178 | ||
179 | Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, | |
180 | as explained before, to avoid blocking read and write system calls on | |
181 | the socket. | |
182 | ||
183 | The 32-bit notification range is encoded as [ee_info, ee_data]. This | |
184 | range is inclusive. Other fields in the struct must be treated as | |
185 | undefined, bar for ee_code, as discussed below. | |
186 | ||
187 | :: | |
188 | ||
189 | struct sock_extended_err *serr; | |
190 | struct cmsghdr *cm; | |
191 | ||
192 | cm = CMSG_FIRSTHDR(msg); | |
193 | if (cm->cmsg_level != SOL_IP && | |
194 | cm->cmsg_type != IP_RECVERR) | |
195 | error(1, 0, "cmsg"); | |
196 | ||
197 | serr = (void *) CMSG_DATA(cm); | |
198 | if (serr->ee_errno != 0 || | |
199 | serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) | |
200 | error(1, 0, "serr"); | |
201 | ||
202 | printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); | |
203 | ||
204 | ||
205 | Deferred copies | |
206 | ~~~~~~~~~~~~~~~ | |
207 | ||
208 | Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy | |
209 | avoidance, and a contract that the kernel will queue a completion | |
210 | notification. It is not a guarantee that the copy is elided. | |
211 | ||
212 | Copy avoidance is not always feasible. Devices that do not support | |
213 | scatter-gather I/O cannot send packets made up of kernel generated | |
214 | protocol headers plus zerocopy user data. A packet may need to be | |
215 | converted to a private copy of data deep in the stack, say to compute | |
216 | a checksum. | |
217 | ||
218 | In all these cases, the kernel returns a completion notification when | |
219 | it releases its hold on the shared pages. That notification may arrive | |
220 | before the (copied) data is fully transmitted. A zerocopy completion | |
221 | notification is not a transmit completion notification, therefore. | |
222 | ||
223 | Deferred copies can be more expensive than a copy immediately in the | |
224 | system call, if the data is no longer warm in the cache. The process | |
225 | also incurs notification processing cost for no benefit. For this | |
226 | reason, the kernel signals if data was completed with a copy, by | |
227 | setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. | |
228 | A process may use this signal to stop passing flag MSG_ZEROCOPY on | |
229 | subsequent requests on the same socket. | |
230 | ||
231 | ||
232 | Implementation | |
233 | ============== | |
234 | ||
235 | Loopback | |
236 | -------- | |
237 | ||
238 | Data sent to local sockets can be queued indefinitely if the receive | |
239 | process does not read its socket. Unbound notification latency is not | |
240 | acceptable. For this reason all packets generated with MSG_ZEROCOPY | |
241 | that are looped to a local socket will incur a deferred copy. This | |
242 | includes looping onto packet sockets (e.g., tcpdump) and tun devices. | |
243 | ||
244 | ||
245 | Testing | |
246 | ======= | |
247 | ||
248 | More realistic example code can be found in the kernel source under | |
249 | tools/testing/selftests/net/msg_zerocopy.c. | |
250 | ||
251 | Be cognizant of the loopback constraint. The test can be run between | |
252 | a pair of hosts. But if run between a local pair of processes, for | |
253 | instance when run with msg_zerocopy.sh between a veth pair across | |
254 | namespaces, the test will not show any improvement. For testing, the | |
255 | loopback restriction can be temporarily relaxed by making | |
256 | skb_orphan_frags_rx identical to skb_orphan_frags. |