]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ======================== |
2 | Using Hadoop with CephFS | |
3 | ======================== | |
4 | ||
5 | The Ceph file system can be used as a drop-in replacement for the Hadoop File | |
6 | System (HDFS). This page describes the installation and configuration process | |
7 | of using Ceph with Hadoop. | |
8 | ||
9 | Dependencies | |
10 | ============ | |
11 | ||
12 | * CephFS Java Interface | |
13 | * Hadoop CephFS Plugin | |
14 | ||
15 | .. important:: Currently requires Hadoop 1.1.X stable series | |
16 | ||
17 | Installation | |
18 | ============ | |
19 | ||
20 | There are three requirements for using CephFS with Hadoop. First, a running | |
21 | Ceph installation is required. The details of setting up a Ceph cluster and | |
22 | the file system are beyond the scope of this document. Please refer to the | |
23 | Ceph documentation for installing Ceph. | |
24 | ||
25 | The remaining two requirements are a Hadoop installation, and the Ceph file | |
26 | system Java packages, including the Java CephFS Hadoop plugin. The high-level | |
27 | steps are two add the dependencies to the Hadoop installation ``CLASSPATH``, | |
28 | and configure Hadoop to use the Ceph file system. | |
29 | ||
30 | CephFS Java Packages | |
31 | -------------------- | |
32 | ||
33 | * CephFS Hadoop plugin (`hadoop-cephfs.jar <http://ceph.com/download/hadoop-cephfs.jar>`_) | |
34 | ||
35 | Adding these dependencies to a Hadoop installation will depend on your | |
36 | particular deployment. In general the dependencies must be present on each | |
37 | node in the system that will be part of the Hadoop cluster, and must be in the | |
38 | ``CLASSPATH`` searched for by Hadoop. Typically approaches are to place the | |
39 | additional ``jar`` files into the ``hadoop/lib`` directory, or to edit the | |
40 | ``HADOOP_CLASSPATH`` variable in ``hadoop-env.sh``. | |
41 | ||
42 | The native Ceph file system client must be installed on each participating | |
43 | node in the Hadoop cluster. | |
44 | ||
45 | Hadoop Configuration | |
46 | ==================== | |
47 | ||
48 | This section describes the Hadoop configuration options used to control Ceph. | |
49 | These options are intended to be set in the Hadoop configuration file | |
50 | `conf/core-site.xml`. | |
51 | ||
52 | +---------------------+--------------------------+----------------------------+ | |
53 | |Property |Value |Notes | | |
54 | | | | | | |
55 | +=====================+==========================+============================+ | |
56 | |fs.default.name |Ceph URI |ceph://[monaddr:port]/ | | |
57 | | | | | | |
58 | | | | | | |
59 | +---------------------+--------------------------+----------------------------+ | |
60 | |ceph.conf.file |Local path to ceph.conf |/etc/ceph/ceph.conf | | |
61 | | | | | | |
62 | | | | | | |
63 | | | | | | |
64 | +---------------------+--------------------------+----------------------------+ | |
65 | |ceph.conf.options |Comma separated list of |opt1=val1,opt2=val2 | | |
66 | | |Ceph configuration | | | |
67 | | |key/value pairs | | | |
68 | | | | | | |
69 | +---------------------+--------------------------+----------------------------+ | |
70 | |ceph.root.dir |Mount root directory |Default value: / | | |
71 | | | | | | |
72 | | | | | | |
73 | +---------------------+--------------------------+----------------------------+ | |
74 | |ceph.mon.address |Monitor address |host:port | | |
75 | | | | | | |
76 | | | | | | |
77 | | | | | | |
78 | +---------------------+--------------------------+----------------------------+ | |
79 | |ceph.auth.id |Ceph user id |Example: admin | | |
80 | | | | | | |
81 | | | | | | |
82 | | | | | | |
83 | +---------------------+--------------------------+----------------------------+ | |
84 | |ceph.auth.keyfile |Ceph key file | | | |
85 | | | | | | |
86 | | | | | | |
87 | | | | | | |
88 | +---------------------+--------------------------+----------------------------+ | |
89 | |ceph.auth.keyring |Ceph keyring file | | | |
90 | | | | | | |
91 | | | | | | |
92 | | | | | | |
93 | +---------------------+--------------------------+----------------------------+ | |
94 | |ceph.object.size |Default file object size |Default value (64MB): | | |
95 | | |in bytes |67108864 | | |
96 | | | | | | |
97 | | | | | | |
98 | +---------------------+--------------------------+----------------------------+ | |
99 | |ceph.data.pools |List of Ceph data pools |Default value: default Ceph | | |
100 | | |for storing file. |pool. | | |
101 | | | | | | |
102 | | | | | | |
103 | +---------------------+--------------------------+----------------------------+ | |
104 | |ceph.localize.reads |Allow reading from file |Default value: true | | |
105 | | |replica objects | | | |
106 | | | | | | |
107 | | | | | | |
108 | +---------------------+--------------------------+----------------------------+ | |
109 | ||
110 | Support For Per-file Custom Replication | |
111 | --------------------------------------- | |
112 | ||
113 | The Hadoop file system interface allows users to specify a custom replication | |
114 | factor (e.g. 3 copies of each block) when creating a file. However, object | |
115 | replication factors in the Ceph file system are controlled on a per-pool | |
116 | basis, and by default a Ceph file system will contain only a single | |
117 | pre-configured pool. Thus, in order to support per-file replication with | |
118 | Hadoop over Ceph, additional storage pools with non-default replications | |
119 | factors must be created, and Hadoop must be configured to choose from these | |
120 | additional pools. | |
121 | ||
122 | Additional data pools can be specified using the ``ceph.data.pools`` | |
123 | configuration option. The value of the option is a comma separated list of | |
124 | pool names. The default Ceph pool will be used automatically if this | |
125 | configuration option is omitted or the value is empty. For example, the | |
126 | following configuration setting will consider the pools ``pool1``, ``pool2``, and | |
127 | ``pool5`` when selecting a target pool to store a file. :: | |
128 | ||
129 | <property> | |
130 | <name>ceph.data.pools</name> | |
131 | <value>pool1,pool2,pool5</value> | |
132 | </property> | |
133 | ||
134 | Hadoop will not create pools automatically. In order to create a new pool with | |
135 | a specific replication factor use the ``ceph osd pool create`` command, and then | |
136 | set the ``size`` property on the pool using the ``ceph osd pool set`` command. For | |
137 | more information on creating and configuring pools see the `RADOS Pool | |
138 | documentation`_. | |
139 | ||
140 | .. _RADOS Pool documentation: ../../rados/operations/pools | |
141 | ||
142 | Once a pool has been created and configured the metadata service must be told | |
143 | that the new pool may be used to store file data. A pool is be made available | |
144 | for storing file system data using the ``ceph fs add_data_pool`` command. | |
145 | ||
146 | First, create the pool. In this example we create the ``hadoop1`` pool with | |
147 | replication factor 1. :: | |
148 | ||
149 | ceph osd pool create hadoop1 100 | |
150 | ceph osd pool set hadoop1 size 1 | |
151 | ||
152 | Next, determine the pool id. This can be done by examining the output of the | |
153 | ``ceph osd dump`` command. For example, we can look for the newly created | |
154 | ``hadoop1`` pool. :: | |
155 | ||
156 | ceph osd dump | grep hadoop1 | |
157 | ||
158 | The output should resemble:: | |
159 | ||
b32b8144 | 160 | pool 3 'hadoop1' rep size 1 min_size 1 crush_rule 0... |
7c673cae FG |
161 | |
162 | where ``3`` is the pool id. Next we will use the pool id reference to register | |
163 | the pool as a data pool for storing file system data. :: | |
164 | ||
165 | ceph fs add_data_pool cephfs 3 | |
166 | ||
167 | The final step is to configure Hadoop to consider this data pool when | |
168 | selecting the target pool for new files. :: | |
169 | ||
170 | <property> | |
171 | <name>ceph.data.pools</name> | |
172 | <value>hadoop1</value> | |
173 | </property> | |
174 | ||
175 | Pool Selection Rules | |
176 | ~~~~~~~~~~~~~~~~~~~~ | |
177 | ||
178 | The following rules describe how Hadoop chooses a pool given a desired | |
179 | replication factor and the set of pools specified using the | |
180 | ``ceph.data.pools`` configuration option. | |
181 | ||
182 | 1. When no custom pools are specified the default Ceph data pool is used. | |
183 | 2. A custom pool with the same replication factor as the default Ceph data | |
184 | pool will override the default. | |
185 | 3. A pool with a replication factor that matches the desired replication will | |
186 | be chosen if it exists. | |
187 | 4. Otherwise, a pool with at least the desired replication factor will be | |
188 | chosen, or the maximum possible. | |
189 | ||
190 | Debugging Pool Selection | |
191 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
192 | ||
193 | Hadoop will produce log file entry when it cannot determine the replication | |
194 | factor of a pool (e.g. it is not configured as a data pool). The log message | |
195 | will appear as follows:: | |
196 | ||
197 | Error looking up replication of pool: <pool name> | |
198 | ||
199 | Hadoop will also produce a log entry when it wasn't able to select an exact | |
200 | match for replication. This log entry will appear as follows:: | |
201 | ||
202 | selectDataPool path=<path> pool:repl=<name>:<value> wanted=<value> |