]> git.proxmox.com Git - mirror_qemu.git/blame - docs/nvdimm.txt
nvdimm: check -object memory-backend-file, readonly=on option
[mirror_qemu.git] / docs / nvdimm.txt
CommitLineData
79c0f397
HZ
1QEMU Virtual NVDIMM
2===================
3
4This document explains the usage of virtual NVDIMM (vNVDIMM) feature
5which is available since QEMU v2.6.0.
6
7The current QEMU only implements the persistent memory mode of vNVDIMM
8device and not the block window mode.
9
10Basic Usage
11-----------
12
13The storage of a vNVDIMM device in QEMU is provided by the memory
14backend (i.e. memory-backend-file and memory-backend-ram). A simple
15way to create a vNVDIMM device at startup time is done via the
16following command line options:
17
18 -machine pc,nvdimm
19 -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
dbd730e8
SH
20 -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off
21 -device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off
79c0f397
HZ
22
23Where,
24
25 - the "nvdimm" machine option enables vNVDIMM feature.
26
27 - "slots=$N" should be equal to or larger than the total amount of
28 normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.
29
30 - "maxmem=$MAX_SIZE" should be equal to or larger than the total size
31 of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
32 >= $RAM_SIZE + $NVDIMM_SIZE here.
33
dbd730e8
SH
34 - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,
35 size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size
36 $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go
37 to the file $PATH.
79c0f397
HZ
38
39 "share=on/off" controls the visibility of guest writes. If
40 "share=on", then guest writes will be applied to the backend
41 file. If another guest uses the same backend file with option
42 "share=on", then above writes will be visible to it as well. If
43 "share=off", then guest writes won't be applied to the backend
44 file and thus will be invisible to other guests.
45
dbd730e8
SH
46 "readonly=on/off" controls whether the file $PATH is opened read-only or
47 read/write (default).
48
49 - "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write
50 virtual NVDIMM device whose storage is provided by above memory backend
51 device.
52
53 "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
54 State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
55 persistent writes. Linux guest drivers set the device to read-only when this
56 bit is present. Set unarmed to on when the memdev has readonly=on.
79c0f397
HZ
57
58Multiple vNVDIMM devices can be created if multiple pairs of "-object"
59and "-device" are provided.
60
61For above command line options, if the guest OS has the proper NVDIMM
bd54b110
KC
62driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to
63detect a NVDIMM device which is in the persistent memory mode and whose
64size is $NVDIMM_SIZE.
79c0f397
HZ
65
66Note:
67
681. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
69 backend file size is not equal to the size given by "size" option,
70 QEMU will truncate the backend file by ftruncate(2), which will
71 corrupt the existing data in the backend file, especially for the
72 shrink case.
73
74 QEMU v2.8.0 and later check the backend file size and the "size"
75 option. If they do not match, QEMU will report errors and abort in
76 order to avoid the data corruption.
77
782. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
79 option of memory-backend-file, e.g. 4KB alignment on x86. However,
80 QEMU v.2.7.0 puts an additional alignment requirement, which may
81 require a larger value than the basic one, e.g. 2MB on x86. This
82 change breaks the usage of memory-backend-file that only satisfies
83 the basic alignment.
84
85 QEMU v2.8.0 and later remove the additional alignment on non-s390x
86 architectures, so the broken memory-backend-file can work again.
87
88Label
89-----
90
91QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
92To enable label on vNVDIMM devices, users can simply add
93"label-size=$SZ" option to "-device nvdimm", e.g.
94
95 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K
96
97Note:
98
991. The minimal label size is 128KB.
100
1012. QEMU v2.7.0 and later store labels at the end of backend storage.
102 If a memory backend file, which was previously used as the backend
103 of a vNVDIMM device without labels, is now used for a vNVDIMM
104 device with label, the data in the label area at the end of file
105 will be inaccessible to the guest. If any useful data (e.g. the
106 meta-data of the file system) was stored there, the latter usage
107 may result guest data corruption (e.g. breakage of guest file
108 system).
109
110Hotplug
111-------
112
113QEMU v2.8.0 and later implement the hotplug support for vNVDIMM
114devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
115accomplished by two monitor commands "object_add" and "device_add".
116
117For example, the following commands add another 4GB vNVDIMM device to
118the guest:
119
120 (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G
121 (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2
122
123Note:
124
1251. Each hotplugged vNVDIMM device consumes one memory slot. Users
126 should always ensure the memory option "-m ...,slots=N" specifies
127 enough number of slots, i.e.
128 N >= number of RAM devices +
129 number of statically plugged vNVDIMM devices +
130 number of hotplugged vNVDIMM devices
131
1322. The similar is required for the memory option "-m ...,maxmem=M", i.e.
133 M >= size of RAM devices +
134 size of statically plugged vNVDIMM devices +
135 size of hotplugged vNVDIMM devices
98376843
HZ
136
137Alignment
138---------
139
140QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping
141address to the page size (getpagesize(2)) by default. However, some
142types of backends may require an alignment different than the page
143size. In that case, QEMU v2.12.0 and later provide 'align' option to
144memory-backend-file to allow users to specify the proper alignment.
5f509751
JL
145For device dax (e.g., /dev/dax0.0), this alignment needs to match the
146alignment requirement of the device dax. The NUM of 'align=NUM' option
147must be larger than or equal to the 'align' of device dax.
148We can use one of the following commands to show the 'align' of device dax.
149
150 ndctl list -X
151 daxctl list -R
152
153In order to get the proper 'align' of device dax, you need to install
154the library 'libdaxctl'.
98376843
HZ
155
156For example, device dax require the 2 MB alignment, so we can use
157following QEMU command line options to use it (/dev/dax0.0) as the
158backend of vNVDIMM:
159
160 -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M
161 -device nvdimm,id=nvdimm1,memdev=mem1
cb836434
HZ
162
163Guest Data Persistence
164----------------------
165
166Though QEMU supports multiple types of vNVDIMM backends on Linux,
119906af
ZY
167the only backend that can guarantee the guest write persistence is:
168
169A. DAX device (e.g., /dev/dax0.0, ) or
170B. DAX file(mounted with dax option)
171
172When using B (A file supporting direct mapping of persistent memory)
173as a backend, write persistence is guaranteed if the host kernel has
174support for the MAP_SYNC flag in the mmap system call (available
175since Linux 4.15 and on certain distro kernels) and additionally
176both 'pmem' and 'share' flags are set to 'on' on the backend.
177
178If these conditions are not satisfied i.e. if either 'pmem' or 'share'
179are not set, if the backend file does not support DAX or if MAP_SYNC
180is not supported by the host kernel, write persistence is not
181guaranteed after a system crash. For compatibility reasons, these
182conditions are ignored if not satisfied. Currently, no way is
183provided to test for them.
184For more details, please reference mmap(2) man page:
185http://man7.org/linux/man-pages/man2/mmap.2.html.
cb836434
HZ
186
187When using other types of backends, it's suggested to set 'unarmed'
188option of '-device nvdimm' to 'on', which sets the unarmed flag of the
189guest NVDIMM region mapping structure. This unarmed flag indicates
190guest software that this vNVDIMM device contains a region that cannot
191accept persistent writes. In result, for example, the guest Linux
192NVDIMM driver, marks such vNVDIMM device as read-only.
9ab3aad2 193
d8b92bd4
WY
194Backend File Setup Example
195--------------------------
196
197Here are two examples showing how to setup these persistent backends on
198linux using the tool ndctl [3].
199
200A. DAX device
201
202Use the following command to set up /dev/dax0.0 so that the entirety of
203namespace0.0 can be exposed as an emulated NVDIMM to the guest:
204
205 ndctl create-namespace -f -e namespace0.0 -m devdax
206
207The /dev/dax0.0 could be used directly in "mem-path" option.
208
209B. DAX file
210
211Individual files on a DAX host file system can be exposed as emulated
212NVDIMMS. First an fsdax block device is created, partitioned, and then
213mounted with the "dax" mount option:
214
215 ndctl create-namespace -f -e namespace0.0 -m fsdax
216 (partition /dev/pmem0 with name pmem0p1)
217 mount -o dax /dev/pmem0p1 /mnt
218 (create or copy a disk image file with qemu-img(1), cp(1), or dd(1)
219 in /mnt)
220
221Then the new file in /mnt could be used in "mem-path" option.
222
11c39b5c
RZ
223NVDIMM Persistence
224------------------
9ab3aad2
RZ
225
226ACPI 6.2 Errata A added support for a new Platform Capabilities Structure
227which allows the platform to communicate what features it supports related to
11c39b5c
RZ
228NVDIMM data persistence. Users can provide a persistence value to a guest via
229the optional "nvdimm-persistence" machine command line option:
9ab3aad2 230
11c39b5c 231 -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu
9ab3aad2 232
11c39b5c 233There are currently two valid values for this option:
9ab3aad2 234
11c39b5c
RZ
235"mem-ctrl" - The platform supports flushing dirty data from the memory
236 controller to the NVDIMMs in the event of power loss.
9ab3aad2 237
11c39b5c
RZ
238"cpu" - The platform supports flushing dirty data from the CPU cache to
239 the NVDIMMs in the event of power loss. This implies that the
240 platform also supports flushing dirty data through the memory
241 controller on power loss.
a4de8552
JH
242
243If the vNVDIMM backend is in host persistent memory that can be accessed in
244SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set
245the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU
246is built with libpmem [2] support (configured with --enable-libpmem), QEMU
247will take necessary operations to guarantee the persistence of its own writes
248to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration).
249If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report
250a "lack of libpmem support" message to ensure the persistence is available.
251For example, if we want to ensure the persistence for some backend file,
252use the QEMU command line:
253
254 -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on
255
256References
257----------
258
259[1] NVM Programming Model (NPM)
260 Version 1.2
261 https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
262[2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page:
263 http://pmem.io/pmdk/
d8b92bd4
WY
264[3] ndctl-create-namespace - provision or reconfigure a namespace
265 http://pmem.io/ndctl/ndctl-create-namespace.html