]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/zoned-storage.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / dev / zoned-storage.rst
1 =======================
2 Zoned Storage Support
3 =======================
4
5 http://zonedstorage.io
6
7 Zoned Storage is a class of storage devices that enables host and storage
8 devices to cooperate to achieve higher storage capacities, increased throughput,
9 and lower latencies. The zoned storage interface is available through the SCSI
10 Zoned Block Commands (ZBC) and Zoned Device ATA Command Set (ZAC) standards on
11 Shingled Magnetic Recording (SMR) hard disks today and is also being adopted for
12 NVMe Solid State Disks with the upcoming NVMe Zoned Namespaces (ZNS) standard.
13
14 This project aims to enable Ceph to work on zoned storage drives and at the same
15 time explore research problems related to adopting this new interface. The
16 first target is to enable non-overwrite workloads (e.g. RGW) on host-managed SMR
17 (HM-SMR) drives and explore cleaning (garbage collection) policies. HM-SMR
18 drives are high capacity hard drives with the ZBC/ZAC interface. The longer
19 term goal is to support ZNS SSDs, as they become available, as well as overwrite
20 workloads.
21
22 The first patch in these series enabled writing data to HM-SMR drives. This
23 patch introduces ZonedFreelistManger, a FreelistManager implementation that
24 passes enough information to ZonedAllocator to correctly initialize state of
25 zones by tracking the write pointer and the number of dead bytes per zone. We
26 have to introduce a new FreelistManager implementation because with zoned
27 devices a region of disk can be in three states (empty, used, and dead), whereas
28 current BitmapFreelistManager tracks only two states (empty and used). It is
29 not possible to accurately initialize the state of zones in ZonedAllocator by
30 tracking only two states. The third planned patch will introduce a rudimentary
31 cleaner to form a baseline for further research.
32
33 Currently we can perform basic RADOS benchmarks on an OSD running on an HM-SMR
34 drives, restart the OSD, and read the written data, and write new data, as can
35 be seen below.
36
37 Please contact Abutalib Aghayev <agayev@psu.edu> for questions.
38
39 ::
40
41 $ sudo zbd report -i -n /dev/sdc
42 Device /dev/sdc:
43 Vendor ID: ATA HGST HSH721414AL T240
44 Zone model: host-managed
45 Capacity: 14000.520 GB (27344764928 512-bytes sectors)
46 Logical blocks: 3418095616 blocks of 4096 B
47 Physical blocks: 3418095616 blocks of 4096 B
48 Zones: 52156 zones of 256.0 MB
49 Maximum number of open zones: no limit
50 Maximum number of active zones: no limit
51 52156 / 52156 zones
52 $ MON=1 OSD=1 MDS=0 sudo ../src/vstart.sh --new --localhost --bluestore --bluestore-devs /dev/sdc --bluestore-zoned
53 <snipped verbose output>
54 $ sudo ./bin/ceph osd pool create bench 32 32
55 pool 'bench' created
56 $ sudo ./bin/rados bench -p bench 10 write --no-cleanup
57 hints = 1
58 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
59 Object prefix: benchmark_data_h0.cc.journaling712.narwhal.p_29846
60 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
61 0 0 0 0 0 0 - 0
62 1 16 45 29 115.943 116 0.384175 0.407806
63 2 16 86 70 139.949 164 0.259845 0.391488
64 3 16 125 109 145.286 156 0.31727 0.404727
65 4 16 162 146 145.953 148 0.826671 0.409003
66 5 16 203 187 149.553 164 0.44815 0.404303
67 6 16 242 226 150.621 156 0.227488 0.409872
68 7 16 281 265 151.384 156 0.411896 0.408686
69 8 16 320 304 151.956 156 0.435135 0.411473
70 9 16 359 343 152.401 156 0.463699 0.408658
71 10 15 396 381 152.356 152 0.409554 0.410851
72 Total time run: 10.3305
73 Total writes made: 396
74 Write size: 4194304
75 Object size: 4194304
76 Bandwidth (MB/sec): 153.333
77 Stddev Bandwidth: 13.6561
78 Max bandwidth (MB/sec): 164
79 Min bandwidth (MB/sec): 116
80 Average IOPS: 38
81 Stddev IOPS: 3.41402
82 Max IOPS: 41
83 Min IOPS: 29
84 Average Latency(s): 0.411226
85 Stddev Latency(s): 0.180238
86 Max latency(s): 1.00844
87 Min latency(s): 0.108616
88 $ sudo ../src/stop.sh
89 $ # Notice the lack of "--new" parameter to vstart.sh
90 $ MON=1 OSD=1 MDS=0 sudo ../src/vstart.sh --localhost --bluestore --bluestore-devs /dev/sdc --bluestore-zoned
91 <snipped verbose output>
92 $ sudo ./bin/rados bench -p bench 10 rand
93 hints = 1
94 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
95 0 0 0 0 0 0 - 0
96 1 16 61 45 179.903 180 0.117329 0.244067
97 2 16 116 100 199.918 220 0.144162 0.292305
98 3 16 174 158 210.589 232 0.170941 0.285481
99 4 16 251 235 234.918 308 0.241175 0.256543
100 5 16 316 300 239.914 260 0.206044 0.255882
101 6 15 392 377 251.206 308 0.137972 0.247426
102 7 15 458 443 252.984 264 0.0800146 0.245138
103 8 16 529 513 256.346 280 0.103529 0.239888
104 9 16 587 571 253.634 232 0.145535 0.2453
105 10 15 646 631 252.254 240 0.837727 0.246019
106 Total time run: 10.272
107 Total reads made: 646
108 Read size: 4194304
109 Object size: 4194304
110 Bandwidth (MB/sec): 251.558
111 Average IOPS: 62
112 Stddev IOPS: 10.005
113 Max IOPS: 77
114 Min IOPS: 45
115 Average Latency(s): 0.249385
116 Max latency(s): 0.888654
117 Min latency(s): 0.0103208
118 $ sudo ./bin/rados bench -p bench 10 write --no-cleanup
119 hints = 1
120 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
121 Object prefix: benchmark_data_h0.aa.journaling712.narwhal.p_64416
122 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
123 0 0 0 0 0 0 - 0
124 1 16 46 30 119.949 120 0.52627 0.396166
125 2 16 82 66 131.955 144 0.48087 0.427311
126 3 16 123 107 142.627 164 0.3287 0.420614
127 4 16 158 142 141.964 140 0.405177 0.425993
128 5 16 192 176 140.766 136 0.514565 0.425175
129 6 16 224 208 138.635 128 0.69184 0.436672
130 7 16 261 245 139.967 148 0.459929 0.439502
131 8 16 301 285 142.468 160 0.250846 0.434799
132 9 16 336 320 142.189 140 0.621686 0.435457
133 10 16 374 358 143.166 152 0.460593 0.436384
134