]> git.proxmox.com Git - ceph.git/blob - ceph/doc/cephfs/capabilities.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / cephfs / capabilities.rst
1 ======================
2 Capabilities in CephFS
3 ======================
4 When a client wants to operate on an inode, it will query the MDS in various
5 ways, which will then grant the client a set of **capabilities**. These
6 grant the client permissions to operate on the inode in various ways. One
7 of the major differences from other network file systems (e.g NFS or SMB) is
8 that the capabilities granted are quite granular, and it's possible that
9 multiple clients can hold different capabilities on the same inodes.
10
11 Types of Capabilities
12 ---------------------
13 There are several "generic" capability bits. These denote what sort of ability
14 the capability grants.
15
16 ::
17
18 /* generic cap bits */
19 #define CEPH_CAP_GSHARED 1 /* (metadata) client can reads (s) */
20 #define CEPH_CAP_GEXCL 2 /* (metadata) client can read and update (x) */
21 #define CEPH_CAP_GCACHE 4 /* (file) client can cache reads (c) */
22 #define CEPH_CAP_GRD 8 /* (file) client can read (r) */
23 #define CEPH_CAP_GWR 16 /* (file) client can write (w) */
24 #define CEPH_CAP_GBUFFER 32 /* (file) client can buffer writes (b) */
25 #define CEPH_CAP_GWREXTEND 64 /* (file) client can extend EOF (a) */
26 #define CEPH_CAP_GLAZYIO 128 /* (file) client can perform lazy io (l) */
27
28 These are then shifted by a particular number of bits. These denote a part of
29 the inode's data or metadata on which the capability is being granted:
30
31 ::
32
33 /* per-lock shift */
34 #define CEPH_CAP_SAUTH 2 /* A */
35 #define CEPH_CAP_SLINK 4 /* L */
36 #define CEPH_CAP_SXATTR 6 /* X */
37 #define CEPH_CAP_SFILE 8 /* F */
38
39 Only certain generic cap types are ever granted for some of those "shifts",
40 however. In particular, only the FILE shift ever has more than the first two
41 bits.
42
43 ::
44
45 | AUTH | LINK | XATTR | FILE
46 2 4 6 8
47
48 From the above, we get a number of constants, that are generated by taking
49 each bit value and shifting to the correct bit in the word:
50
51 ::
52
53 #define CEPH_CAP_AUTH_SHARED (CEPH_CAP_GSHARED << CEPH_CAP_SAUTH)
54
55 These bits can then be or'ed together to make a bitmask denoting a set of
56 capabilities.
57
58 There is one exception:
59
60 ::
61
62 #define CEPH_CAP_PIN 1 /* no specific capabilities beyond the pin */
63
64 The "pin" just pins the inode into memory, without granting any other caps.
65
66 Graphically:
67
68 ::
69
70 +---+---+---+---+---+---+---+---+
71 | p | _ |As x |Ls x |Xs x |
72 +---+---+---+---+---+---+---+---+
73 |Fs x c r w b a l |
74 +---+---+---+---+---+---+---+---+
75
76 The second bit is currently unused.
77
78 Abilities granted by each cap
79 -----------------------------
80 While that is how capabilities are granted (and communicated), the important
81 bit is what they actually allow the client to do:
82
83 * **PIN**: this just pins the inode into memory. This is sufficient to allow
84 the client to get to the inode number, as well as other immutable things like
85 major or minor numbers in a device inode, or symlink contents.
86
87 * **AUTH**: this grants the ability to get to the authentication-related metadata.
88 In particular, the owner, group and mode. Note that doing a full permission
89 check may require getting at ACLs as well, which are stored in xattrs.
90
91 * **LINK**: the link count of the inode
92
93 * **XATTR**: ability to access or manipulate xattrs. Note that since ACLs are
94 stored in xattrs, it's also sometimes necessary to access them when checking
95 permissions.
96
97 * **FILE**: this is the big one. These allow the client to access and manipulate
98 file data. It also covers certain metadata relating to file data -- the
99 size, mtime, atime and ctime, in particular.
100
101 Shorthand
102 ---------
103 Note that the client logging can also present a compact representation of the
104 capabilities. For example:
105
106 ::
107
108 pAsLsXsFs
109
110 The 'p' represents the pin. Each capital letter corresponds to the shift
111 values, and the lowercase letters after each shift are for the actual
112 capabilities granted in each shift.
113
114 The relation between the lock states and the capabilities
115 ---------------------------------------------------------
116 In MDS there are four different locks for each inode, they are simplelock,
117 scatterlock, filelock and locallock. Each lock has several different lock
118 states, and the MDS will issue capabilities to clients based on the lock
119 state.
120
121 In each state the MDS Locker will always try to issue all the capabilities to the
122 clients allowed, even some capabilities are not needed or wanted by the clients,
123 as pre-issuing capabilities could reduce latency in some cases.
124
125 If there is only one client, usually it will be the loner client for all the inodes.
126 While in multiple clients case, the MDS will try to calculate a loner client out for
127 each inode depending on the capabilities the clients (needed | wanted), but usually
128 it will fail. The loner client will always get all the capabilities.
129
130 The filelock will control files' partial metadatas' and the file contents' access
131 permissions. The metadatas include **mtime**, **atime**, **size**, etc.
132
133 * **Fs**: Once a client has it, all other clients are denied **Fw**.
134
135 * **Fx**: Only the loner client is allowed this capability. Once the lock state
136 transitions to LOCK_EXCL, the loner client is granted this along with all other
137 file capabilities except the **Fl**.
138
139 * **Fr**: Once a client has it, the **Fb** capability will be already revoked from
140 all the other clients.
141
142 If clients only request to read the file, the lock state will be transferred
143 to LOCK_SYNC stable state directly. All the clients can be granted **Fscrl**
144 capabilities from the auth MDS and **Fscr** capabilities from the replica MDSes.
145
146 If multiple clients read from and write to the same file, then the lock state
147 will be transferred to LOCK_MIX stable state finally and all the clients could
148 have the **Frwl** capabilities from the auth MDS, and the **Fr** from the replica
149 MDSes. The **Fcb** capabilities won't be granted to all the clients and the
150 clients will do sync read/write.
151
152 * **Fw**: If there is no loner client and once a client have this capability, the
153 **Fsxcb** capabilities won't be granted to other clients.
154
155 If multiple clients read from and write to the same file, then the lock state
156 will be transferred to LOCK_MIX stable state finally and all the clients could
157 have the **Frwl** capabilities from the auth MDS, and the **Fr** from the replica
158 MDSes. The **Fcb** capabilities won't be granted to all the clients and the
159 clients will do sync read/write.
160
161 * **Fc**: This capability means the clients could cache file read and should be
162 issued together with **Fr** capability and only in this use case will it make
163 sense.
164
165 While actually in some stable or interim transitional states they tend to keep
166 the **Fc** allowed even the **Fr** capability isn't granted as this can avoid
167 forcing clients to drop full caches, for example on a simple file size extension
168 or truncating use case.
169
170 * **Fb**: This capability means the clients could buffer file write and should be
171 issued together with **Fw** capability and only in this use case will it make
172 sense.
173
174 While actually in some stable or interim transitional states they tend to keep
175 the **Fc** allowed even the **Fw** capability isn't granted as this can avoid
176 forcing clients to drop dirty buffers, for example on a simple file size extension
177 or truncating use case.
178
179 * **Fl**: This capability means the clients could perform lazy io. LazyIO relaxes
180 POSIX semantics. Buffered reads/writes are allowed even when a file is opened by
181 multiple applications on multiple clients. Applications are responsible for managing
182 cache coherency themselves.