]>
Commit | Line | Data |
---|---|---|
4064174b JC |
1 | The Linux Journalling API |
2 | ========================= | |
3 | ||
4 | Overview | |
5 | -------- | |
6 | ||
7 | Details | |
8 | ~~~~~~~ | |
9 | ||
10 | The journalling layer is easy to use. You need to first of all create a | |
11 | journal_t data structure. There are two calls to do this dependent on | |
12 | how you decide to allocate the physical media on which the journal | |
7caf3e3f PM |
13 | resides. The jbd2_journal_init_inode() call is for journals stored in |
14 | filesystem inodes, or the jbd2_journal_init_dev() call can be used | |
4064174b JC |
15 | for journal stored on a raw device (in a continuous range of blocks). A |
16 | journal_t is a typedef for a struct pointer, so when you are finally | |
7caf3e3f | 17 | finished make sure you call jbd2_journal_destroy() on it to free up |
4064174b JC |
18 | any used kernel memory. |
19 | ||
20 | Once you have got your journal_t object you need to 'mount' or load the | |
21 | journal file. The journalling layer expects the space for the journal | |
22 | was already allocated and initialized properly by the userspace tools. | |
7caf3e3f | 23 | When loading the journal you must call jbd2_journal_load() to process |
4064174b JC |
24 | journal contents. If the client file system detects the journal contents |
25 | does not need to be processed (or even need not have valid contents), it | |
7caf3e3f PM |
26 | may call jbd2_journal_wipe() to clear the journal contents before |
27 | calling jbd2_journal_load(). | |
4064174b JC |
28 | |
29 | Note that jbd2_journal_wipe(..,0) calls | |
7caf3e3f PM |
30 | jbd2_journal_skip_recovery() for you if it detects any outstanding |
31 | transactions in the journal and similarly jbd2_journal_load() will | |
32 | call jbd2_journal_recover() if necessary. I would advise reading | |
33 | ext4_load_journal() in fs/ext4/super.c for examples on this stage. | |
4064174b JC |
34 | |
35 | Now you can go ahead and start modifying the underlying filesystem. | |
36 | Almost. | |
37 | ||
38 | You still need to actually journal your filesystem changes, this is done | |
39 | by wrapping them into transactions. Additionally you also need to wrap | |
40 | the modification of each of the buffers with calls to the journal layer, | |
41 | so it knows what the modifications you are actually making are. To do | |
7caf3e3f | 42 | this use jbd2_journal_start() which returns a transaction handle. |
4064174b | 43 | |
7caf3e3f | 44 | jbd2_journal_start() and its counterpart jbd2_journal_stop(), |
4064174b JC |
45 | which indicates the end of a transaction are nestable calls, so you can |
46 | reenter a transaction if necessary, but remember you must call | |
7caf3e3f PM |
47 | jbd2_journal_stop() the same number of times as |
48 | jbd2_journal_start() before the transaction is completed (or more | |
4064174b JC |
49 | accurately leaves the update phase). Ext4/VFS makes use of this feature to |
50 | simplify handling of inode dirtying, quota support, etc. | |
51 | ||
52 | Inside each transaction you need to wrap the modifications to the | |
53 | individual buffers (blocks). Before you start to modify a buffer you | |
7caf3e3f PM |
54 | need to call jbd2_journal_get_create_access() / |
55 | jbd2_journal_get_write_access() / | |
56 | jbd2_journal_get_undo_access() as appropriate, this allows the | |
4064174b JC |
57 | journalling layer to copy the unmodified |
58 | data if it needs to. After all the buffer may be part of a previously | |
59 | uncommitted transaction. At this point you are at last ready to modify a | |
60 | buffer, and once you are have done so you need to call | |
7caf3e3f | 61 | jbd2_journal_dirty_metadata(). Or if you've asked for access to a |
4064174b | 62 | buffer you now know is now longer required to be pushed back on the |
7caf3e3f PM |
63 | device you can call jbd2_journal_forget() in much the same way as you |
64 | might have used bforget() in the past. | |
4064174b | 65 | |
7caf3e3f | 66 | A jbd2_journal_flush() may be called at any time to commit and |
4064174b JC |
67 | checkpoint all your transactions. |
68 | ||
7caf3e3f PM |
69 | Then at umount time , in your put_super() you can then call |
70 | jbd2_journal_destroy() to clean up your in-core journal object. | |
4064174b JC |
71 | |
72 | Unfortunately there a couple of ways the journal layer can cause a | |
73 | deadlock. The first thing to note is that each task can only have a | |
74 | single outstanding transaction at any one time, remember nothing commits | |
7caf3e3f | 75 | until the outermost jbd2_journal_stop(). This means you must complete |
4064174b JC |
76 | the transaction at the end of each file/inode/address etc. operation you |
77 | perform, so that the journalling system isn't re-entered on another | |
78 | journal. Since transactions can't be nested/batched across differing | |
79 | journals, and another filesystem other than yours (say ext4) may be | |
80 | modified in a later syscall. | |
81 | ||
7caf3e3f | 82 | The second case to bear in mind is that jbd2_journal_start() can block |
4064174b JC |
83 | if there isn't enough space in the journal for your transaction (based |
84 | on the passed nblocks param) - when it blocks it merely(!) needs to wait | |
85 | for transactions to complete and be committed from other tasks, so | |
7caf3e3f PM |
86 | essentially we are waiting for jbd2_journal_stop(). So to avoid |
87 | deadlocks you must treat jbd2_journal_start() / | |
88 | jbd2_journal_stop() as if they were semaphores and include them in | |
4064174b | 89 | your semaphore ordering rules to prevent |
7caf3e3f PM |
90 | deadlocks. Note that jbd2_journal_extend() has similar blocking |
91 | behaviour to jbd2_journal_start() so you can deadlock here just as | |
92 | easily as on jbd2_journal_start(). | |
4064174b JC |
93 | |
94 | Try to reserve the right number of blocks the first time. ;-). This will | |
95 | be the maximum number of blocks you are going to touch in this | |
96 | transaction. I advise having a look at at least ext4_jbd.h to see the | |
97 | basis on which ext4 uses to make these decisions. | |
98 | ||
99 | Another wriggle to watch out for is your on-disk block allocation | |
100 | strategy. Why? Because, if you do a delete, you need to ensure you | |
101 | haven't reused any of the freed blocks until the transaction freeing | |
102 | these blocks commits. If you reused these blocks and crash happens, | |
103 | there is no way to restore the contents of the reallocated blocks at the | |
104 | end of the last fully committed transaction. One simple way of doing | |
105 | this is to mark blocks as free in internal in-memory block allocation | |
106 | structures only after the transaction freeing them commits. Ext4 uses | |
107 | journal commit callback for this purpose. | |
108 | ||
109 | With journal commit callbacks you can ask the journalling layer to call | |
110 | a callback function when the transaction is finally committed to disk, | |
111 | so that you can do some of your own management. You ask the journalling | |
112 | layer for calling the callback by simply setting | |
113 | ``journal->j_commit_callback`` function pointer and that function is | |
114 | called after each transaction commit. You can also use | |
115 | ``transaction->t_private_list`` for attaching entries to a transaction | |
116 | that need processing when the transaction commits. | |
117 | ||
118 | JBD2 also provides a way to block all transaction updates via | |
7caf3e3f PM |
119 | jbd2_journal_lock_updates() / |
120 | jbd2_journal_unlock_updates(). Ext4 uses this when it wants a | |
4064174b JC |
121 | window with a clean and stable fs for a moment. E.g. |
122 | ||
123 | :: | |
124 | ||
125 | ||
126 | jbd2_journal_lock_updates() //stop new stuff happening.. | |
127 | jbd2_journal_flush() // checkpoint everything. | |
128 | ..do stuff on stable fs | |
129 | jbd2_journal_unlock_updates() // carry on with filesystem use. | |
130 | ||
131 | The opportunities for abuse and DOS attacks with this should be obvious, | |
132 | if you allow unprivileged userspace to trigger codepaths containing | |
133 | these calls. | |
134 | ||
135 | Summary | |
136 | ~~~~~~~ | |
137 | ||
138 | Using the journal is a matter of wrapping the different context changes, | |
139 | being each mount, each modification (transaction) and each changed | |
140 | buffer to tell the journalling layer about them. | |
141 | ||
142 | Data Types | |
143 | ---------- | |
144 | ||
145 | The journalling layer uses typedefs to 'hide' the concrete definitions | |
146 | of the structures used. As a client of the JBD2 layer you can just rely | |
147 | on the using the pointer as a magic cookie of some sort. Obviously the | |
148 | hiding is not enforced as this is 'C'. | |
149 | ||
150 | Structures | |
151 | ~~~~~~~~~~ | |
152 | ||
153 | .. kernel-doc:: include/linux/jbd2.h | |
154 | :internal: | |
155 | ||
156 | Functions | |
157 | --------- | |
158 | ||
159 | The functions here are split into two groups those that affect a journal | |
160 | as a whole, and those which are used to manage transactions | |
161 | ||
162 | Journal Level | |
163 | ~~~~~~~~~~~~~ | |
164 | ||
165 | .. kernel-doc:: fs/jbd2/journal.c | |
166 | :export: | |
167 | ||
168 | .. kernel-doc:: fs/jbd2/recovery.c | |
169 | :internal: | |
170 | ||
171 | Transasction Level | |
172 | ~~~~~~~~~~~~~~~~~~ | |
173 | ||
174 | .. kernel-doc:: fs/jbd2/transaction.c | |
175 | ||
176 | See also | |
177 | -------- | |
178 | ||
179 | `Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen | |
180 | Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__ | |
181 | ||
182 | `Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen | |
183 | Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__ | |
184 |