]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/hadoop.rst
update sources to v12.2.3
[ceph.git] / ceph / doc / cephfs / hadoop.rst
CommitLineData
7c673cae
FG
1========================
2Using Hadoop with CephFS
3========================
4
5The Ceph file system can be used as a drop-in replacement for the Hadoop File
6System (HDFS). This page describes the installation and configuration process
7of using Ceph with Hadoop.
8
9Dependencies
10============
11
12* CephFS Java Interface
13* Hadoop CephFS Plugin
14
15.. important:: Currently requires Hadoop 1.1.X stable series
16
17Installation
18============
19
20There are three requirements for using CephFS with Hadoop. First, a running
21Ceph installation is required. The details of setting up a Ceph cluster and
22the file system are beyond the scope of this document. Please refer to the
23Ceph documentation for installing Ceph.
24
25The remaining two requirements are a Hadoop installation, and the Ceph file
26system Java packages, including the Java CephFS Hadoop plugin. The high-level
27steps are two add the dependencies to the Hadoop installation ``CLASSPATH``,
28and configure Hadoop to use the Ceph file system.
29
30CephFS Java Packages
31--------------------
32
33* CephFS Hadoop plugin (`hadoop-cephfs.jar <http://ceph.com/download/hadoop-cephfs.jar>`_)
34
35Adding these dependencies to a Hadoop installation will depend on your
36particular deployment. In general the dependencies must be present on each
37node in the system that will be part of the Hadoop cluster, and must be in the
38``CLASSPATH`` searched for by Hadoop. Typically approaches are to place the
39additional ``jar`` files into the ``hadoop/lib`` directory, or to edit the
40``HADOOP_CLASSPATH`` variable in ``hadoop-env.sh``.
41
42The native Ceph file system client must be installed on each participating
43node in the Hadoop cluster.
44
45Hadoop Configuration
46====================
47
48This section describes the Hadoop configuration options used to control Ceph.
49These options are intended to be set in the Hadoop configuration file
50`conf/core-site.xml`.
51
52+---------------------+--------------------------+----------------------------+
53|Property |Value |Notes |
54| | | |
55+=====================+==========================+============================+
56|fs.default.name |Ceph URI |ceph://[monaddr:port]/ |
57| | | |
58| | | |
59+---------------------+--------------------------+----------------------------+
60|ceph.conf.file |Local path to ceph.conf |/etc/ceph/ceph.conf |
61| | | |
62| | | |
63| | | |
64+---------------------+--------------------------+----------------------------+
65|ceph.conf.options |Comma separated list of |opt1=val1,opt2=val2 |
66| |Ceph configuration | |
67| |key/value pairs | |
68| | | |
69+---------------------+--------------------------+----------------------------+
70|ceph.root.dir |Mount root directory |Default value: / |
71| | | |
72| | | |
73+---------------------+--------------------------+----------------------------+
74|ceph.mon.address |Monitor address |host:port |
75| | | |
76| | | |
77| | | |
78+---------------------+--------------------------+----------------------------+
79|ceph.auth.id |Ceph user id |Example: admin |
80| | | |
81| | | |
82| | | |
83+---------------------+--------------------------+----------------------------+
84|ceph.auth.keyfile |Ceph key file | |
85| | | |
86| | | |
87| | | |
88+---------------------+--------------------------+----------------------------+
89|ceph.auth.keyring |Ceph keyring file | |
90| | | |
91| | | |
92| | | |
93+---------------------+--------------------------+----------------------------+
94|ceph.object.size |Default file object size |Default value (64MB): |
95| |in bytes |67108864 |
96| | | |
97| | | |
98+---------------------+--------------------------+----------------------------+
99|ceph.data.pools |List of Ceph data pools |Default value: default Ceph |
100| |for storing file. |pool. |
101| | | |
102| | | |
103+---------------------+--------------------------+----------------------------+
104|ceph.localize.reads |Allow reading from file |Default value: true |
105| |replica objects | |
106| | | |
107| | | |
108+---------------------+--------------------------+----------------------------+
109
110Support For Per-file Custom Replication
111---------------------------------------
112
113The Hadoop file system interface allows users to specify a custom replication
114factor (e.g. 3 copies of each block) when creating a file. However, object
115replication factors in the Ceph file system are controlled on a per-pool
116basis, and by default a Ceph file system will contain only a single
117pre-configured pool. Thus, in order to support per-file replication with
118Hadoop over Ceph, additional storage pools with non-default replications
119factors must be created, and Hadoop must be configured to choose from these
120additional pools.
121
122Additional data pools can be specified using the ``ceph.data.pools``
123configuration option. The value of the option is a comma separated list of
124pool names. The default Ceph pool will be used automatically if this
125configuration option is omitted or the value is empty. For example, the
126following configuration setting will consider the pools ``pool1``, ``pool2``, and
127``pool5`` when selecting a target pool to store a file. ::
128
129 <property>
130 <name>ceph.data.pools</name>
131 <value>pool1,pool2,pool5</value>
132 </property>
133
134Hadoop will not create pools automatically. In order to create a new pool with
135a specific replication factor use the ``ceph osd pool create`` command, and then
136set the ``size`` property on the pool using the ``ceph osd pool set`` command. For
137more information on creating and configuring pools see the `RADOS Pool
138documentation`_.
139
140.. _RADOS Pool documentation: ../../rados/operations/pools
141
142Once a pool has been created and configured the metadata service must be told
143that the new pool may be used to store file data. A pool is be made available
144for storing file system data using the ``ceph fs add_data_pool`` command.
145
146First, create the pool. In this example we create the ``hadoop1`` pool with
147replication factor 1. ::
148
149 ceph osd pool create hadoop1 100
150 ceph osd pool set hadoop1 size 1
151
152Next, determine the pool id. This can be done by examining the output of the
153``ceph osd dump`` command. For example, we can look for the newly created
154``hadoop1`` pool. ::
155
156 ceph osd dump | grep hadoop1
157
158The output should resemble::
159
b32b8144 160 pool 3 'hadoop1' rep size 1 min_size 1 crush_rule 0...
7c673cae
FG
161
162where ``3`` is the pool id. Next we will use the pool id reference to register
163the pool as a data pool for storing file system data. ::
164
165 ceph fs add_data_pool cephfs 3
166
167The final step is to configure Hadoop to consider this data pool when
168selecting the target pool for new files. ::
169
170 <property>
171 <name>ceph.data.pools</name>
172 <value>hadoop1</value>
173 </property>
174
175Pool Selection Rules
176~~~~~~~~~~~~~~~~~~~~
177
178The following rules describe how Hadoop chooses a pool given a desired
179replication factor and the set of pools specified using the
180``ceph.data.pools`` configuration option.
181
1821. When no custom pools are specified the default Ceph data pool is used.
1832. A custom pool with the same replication factor as the default Ceph data
184 pool will override the default.
1853. A pool with a replication factor that matches the desired replication will
186 be chosen if it exists.
1874. Otherwise, a pool with at least the desired replication factor will be
188 chosen, or the maximum possible.
189
190Debugging Pool Selection
191~~~~~~~~~~~~~~~~~~~~~~~~
192
193Hadoop will produce log file entry when it cannot determine the replication
194factor of a pool (e.g. it is not configured as a data pool). The log message
195will appear as follows::
196
197 Error looking up replication of pool: <pool name>
198
199Hadoop will also produce a log entry when it wasn't able to select an exact
200match for replication. This log entry will appear as follows::
201
202 selectDataPool path=<path> pool:repl=<name>:<value> wanted=<value>