]> git.proxmox.com Git - cargo.git/blob - src/cargo/sources/registry/mod.rs
fix more typos (codespell)
[cargo.git] / src / cargo / sources / registry / mod.rs
1 //! A `Source` for registry-based packages.
2 //!
3 //! # What's a Registry?
4 //!
5 //! Registries are central locations where packages can be uploaded to,
6 //! discovered, and searched for. The purpose of a registry is to have a
7 //! location that serves as permanent storage for versions of a crate over time.
8 //!
9 //! Compared to git sources, a registry provides many packages as well as many
10 //! versions simultaneously. Git sources can also have commits deleted through
11 //! rebasings where registries cannot have their versions deleted.
12 //!
13 //! # The Index of a Registry
14 //!
15 //! One of the major difficulties with a registry is that hosting so many
16 //! packages may quickly run into performance problems when dealing with
17 //! dependency graphs. It's infeasible for cargo to download the entire contents
18 //! of the registry just to resolve one package's dependencies, for example. As
19 //! a result, cargo needs some efficient method of querying what packages are
20 //! available on a registry, what versions are available, and what the
21 //! dependencies for each version is.
22 //!
23 //! One method of doing so would be having the registry expose an HTTP endpoint
24 //! which can be queried with a list of packages and a response of their
25 //! dependencies and versions is returned. This is somewhat inefficient however
26 //! as we may have to hit the endpoint many times and we may have already
27 //! queried for much of the data locally already (for other packages, for
28 //! example). This also involves inventing a transport format between the
29 //! registry and Cargo itself, so this route was not taken.
30 //!
31 //! Instead, Cargo communicates with registries through a git repository
32 //! referred to as the Index. The Index of a registry is essentially an easily
33 //! query-able version of the registry's database for a list of versions of a
34 //! package as well as a list of dependencies for each version.
35 //!
36 //! Using git to host this index provides a number of benefits:
37 //!
38 //! * The entire index can be stored efficiently locally on disk. This means
39 //! that all queries of a registry can happen locally and don't need to touch
40 //! the network.
41 //!
42 //! * Updates of the index are quite efficient. Using git buys incremental
43 //! updates, compressed transmission, etc for free. The index must be updated
44 //! each time we need fresh information from a registry, but this is one
45 //! update of a git repository that probably hasn't changed a whole lot so
46 //! it shouldn't be too expensive.
47 //!
48 //! Additionally, each modification to the index is just appending a line at
49 //! the end of a file (the exact format is described later). This means that
50 //! the commits for an index are quite small and easily applied/compressable.
51 //!
52 //! ## The format of the Index
53 //!
54 //! The index is a store for the list of versions for all packages known, so its
55 //! format on disk is optimized slightly to ensure that `ls registry` doesn't
56 //! produce a list of all packages ever known. The index also wants to ensure
57 //! that there's not a million files which may actually end up hitting
58 //! filesystem limits at some point. To this end, a few decisions were made
59 //! about the format of the registry:
60 //!
61 //! 1. Each crate will have one file corresponding to it. Each version for a
62 //! crate will just be a line in this file.
63 //! 2. There will be two tiers of directories for crate names, under which
64 //! crates corresponding to those tiers will be located.
65 //!
66 //! As an example, this is an example hierarchy of an index:
67 //!
68 //! ```notrust
69 //! .
70 //! ├── 3
71 //! │   └── u
72 //! │   └── url
73 //! ├── bz
74 //! │   └── ip
75 //! │   └── bzip2
76 //! ├── config.json
77 //! ├── en
78 //! │   └── co
79 //! │   └── encoding
80 //! └── li
81 //!    ├── bg
82 //!    │   └── libgit2
83 //!    └── nk
84 //!    └── link-config
85 //! ```
86 //!
87 //! The root of the index contains a `config.json` file with a few entries
88 //! corresponding to the registry (see `RegistryConfig` below).
89 //!
90 //! Otherwise, there are three numbered directories (1, 2, 3) for crates with
91 //! names 1, 2, and 3 characters in length. The 1/2 directories simply have the
92 //! crate files underneath them, while the 3 directory is sharded by the first
93 //! letter of the crate name.
94 //!
95 //! Otherwise the top-level directory contains many two-letter directory names,
96 //! each of which has many sub-folders with two letters. At the end of all these
97 //! are the actual crate files themselves.
98 //!
99 //! The purpose of this layout is to hopefully cut down on `ls` sizes as well as
100 //! efficient lookup based on the crate name itself.
101 //!
102 //! ## Crate files
103 //!
104 //! Each file in the index is the history of one crate over time. Each line in
105 //! the file corresponds to one version of a crate, stored in JSON format (see
106 //! the `RegistryPackage` structure below).
107 //!
108 //! As new versions are published, new lines are appended to this file. The only
109 //! modifications to this file that should happen over time are yanks of a
110 //! particular version.
111 //!
112 //! # Downloading Packages
113 //!
114 //! The purpose of the Index was to provide an efficient method to resolve the
115 //! dependency graph for a package. So far we only required one network
116 //! interaction to update the registry's repository (yay!). After resolution has
117 //! been performed, however we need to download the contents of packages so we
118 //! can read the full manifest and build the source code.
119 //!
120 //! To accomplish this, this source's `download` method will make an HTTP
121 //! request per-package requested to download tarballs into a local cache. These
122 //! tarballs will then be unpacked into a destination folder.
123 //!
124 //! Note that because versions uploaded to the registry are frozen forever that
125 //! the HTTP download and unpacking can all be skipped if the version has
126 //! already been downloaded and unpacked. This caching allows us to only
127 //! download a package when absolutely necessary.
128 //!
129 //! # Filesystem Hierarchy
130 //!
131 //! Overall, the `$HOME/.cargo` looks like this when talking about the registry:
132 //!
133 //! ```notrust
134 //! # A folder under which all registry metadata is hosted (similar to
135 //! # $HOME/.cargo/git)
136 //! $HOME/.cargo/registry/
137 //!
138 //! # For each registry that cargo knows about (keyed by hostname + hash)
139 //! # there is a folder which is the checked out version of the index for
140 //! # the registry in this location. Note that this is done so cargo can
141 //! # support multiple registries simultaneously
142 //! index/
143 //! registry1-<hash>/
144 //! registry2-<hash>/
145 //! ...
146 //!
147 //! # This folder is a cache for all downloaded tarballs from a registry.
148 //! # Once downloaded and verified, a tarball never changes.
149 //! cache/
150 //! registry1-<hash>/<pkg>-<version>.crate
151 //! ...
152 //!
153 //! # Location in which all tarballs are unpacked. Each tarball is known to
154 //! # be frozen after downloading, so transitively this folder is also
155 //! # frozen once its unpacked (it's never unpacked again)
156 //! src/
157 //! registry1-<hash>/<pkg>-<version>/...
158 //! ...
159 //! ```
160
161 use std::borrow::Cow;
162 use std::collections::BTreeMap;
163 use std::collections::HashSet;
164 use std::io::Write;
165 use std::path::{Path, PathBuf};
166
167 use flate2::read::GzDecoder;
168 use log::debug;
169 use semver::Version;
170 use serde::Deserialize;
171 use tar::Archive;
172
173 use crate::core::dependency::{Dependency, Kind};
174 use crate::core::source::MaybePackage;
175 use crate::core::{Package, PackageId, Source, SourceId, Summary};
176 use crate::sources::PathSource;
177 use crate::util::errors::CargoResultExt;
178 use crate::util::hex;
179 use crate::util::to_url::ToUrl;
180 use crate::util::{internal, CargoResult, Config, FileLock, Filesystem};
181
182 const INDEX_LOCK: &str = ".cargo-index-lock";
183 const PACKAGE_SOURCE_LOCK: &str = ".cargo-ok";
184 pub const CRATES_IO_INDEX: &str = "https://github.com/rust-lang/crates.io-index";
185 pub const CRATES_IO_REGISTRY: &str = "crates-io";
186 const CRATE_TEMPLATE: &str = "{crate}";
187 const VERSION_TEMPLATE: &str = "{version}";
188
189 pub struct RegistrySource<'cfg> {
190 source_id: SourceId,
191 src_path: Filesystem,
192 config: &'cfg Config,
193 updated: bool,
194 ops: Box<dyn RegistryData + 'cfg>,
195 index: index::RegistryIndex<'cfg>,
196 yanked_whitelist: HashSet<PackageId>,
197 index_locked: bool,
198 }
199
200 #[derive(Deserialize)]
201 pub struct RegistryConfig {
202 /// Download endpoint for all crates.
203 ///
204 /// The string is a template which will generate the download URL for the
205 /// tarball of a specific version of a crate. The substrings `{crate}` and
206 /// `{version}` will be replaced with the crate's name and version
207 /// respectively.
208 ///
209 /// For backwards compatibility, if the string does not contain `{crate}` or
210 /// `{version}`, it will be extended with `/{crate}/{version}/download` to
211 /// support registries like crates.io which were created before the
212 /// templating setup was created.
213 pub dl: String,
214
215 /// API endpoint for the registry. This is what's actually hit to perform
216 /// operations like yanks, owner modifications, publish new crates, etc.
217 /// If this is None, the registry does not support API commands.
218 pub api: Option<String>,
219 }
220
221 #[derive(Deserialize)]
222 pub struct RegistryPackage<'a> {
223 name: Cow<'a, str>,
224 vers: Version,
225 deps: Vec<RegistryDependency<'a>>,
226 features: BTreeMap<Cow<'a, str>, Vec<Cow<'a, str>>>,
227 cksum: String,
228 yanked: Option<bool>,
229 links: Option<Cow<'a, str>>,
230 }
231
232 #[test]
233 fn escaped_cher_in_json() {
234 let _: RegistryPackage<'_> = serde_json::from_str(
235 r#"{"name":"a","vers":"0.0.1","deps":[],"cksum":"bae3","features":{}}"#,
236 )
237 .unwrap();
238 let _: RegistryPackage<'_> = serde_json::from_str(
239 r#"{"name":"a","vers":"0.0.1","deps":[],"cksum":"bae3","features":{"test":["k","q"]},"links":"a-sys"}"#
240 ).unwrap();
241
242 // Now we add escaped cher all the places they can go
243 // these are not valid, but it should error later than json parsing
244 let _: RegistryPackage<'_> = serde_json::from_str(
245 r#"{
246 "name":"This name has a escaped cher in it \n\t\" ",
247 "vers":"0.0.1",
248 "deps":[{
249 "name": " \n\t\" ",
250 "req": " \n\t\" ",
251 "features": [" \n\t\" "],
252 "optional": true,
253 "default_features": true,
254 "target": " \n\t\" ",
255 "kind": " \n\t\" ",
256 "registry": " \n\t\" "
257 }],
258 "cksum":"bae3",
259 "features":{"test \n\t\" ":["k \n\t\" ","q \n\t\" "]},
260 "links":" \n\t\" "}"#,
261 )
262 .unwrap();
263 }
264
265 #[derive(Deserialize)]
266 #[serde(field_identifier, rename_all = "lowercase")]
267 enum Field {
268 Name,
269 Vers,
270 Deps,
271 Features,
272 Cksum,
273 Yanked,
274 Links,
275 }
276
277 #[derive(Deserialize)]
278 struct RegistryDependency<'a> {
279 name: Cow<'a, str>,
280 req: Cow<'a, str>,
281 features: Vec<Cow<'a, str>>,
282 optional: bool,
283 default_features: bool,
284 target: Option<Cow<'a, str>>,
285 kind: Option<Cow<'a, str>>,
286 registry: Option<Cow<'a, str>>,
287 package: Option<Cow<'a, str>>,
288 public: Option<bool>,
289 }
290
291 impl<'a> RegistryDependency<'a> {
292 /// Converts an encoded dependency in the registry to a cargo dependency
293 pub fn into_dep(self, default: SourceId) -> CargoResult<Dependency> {
294 let RegistryDependency {
295 name,
296 req,
297 mut features,
298 optional,
299 default_features,
300 target,
301 kind,
302 registry,
303 package,
304 public,
305 } = self;
306
307 let id = if let Some(registry) = &registry {
308 SourceId::for_registry(&registry.to_url()?)?
309 } else {
310 default
311 };
312
313 let mut dep =
314 Dependency::parse_no_deprecated(package.as_ref().unwrap_or(&name), Some(&req), id)?;
315 if package.is_some() {
316 dep.set_explicit_name_in_toml(&name);
317 }
318 let kind = match kind.as_ref().map(|s| &s[..]).unwrap_or("") {
319 "dev" => Kind::Development,
320 "build" => Kind::Build,
321 _ => Kind::Normal,
322 };
323
324 let platform = match target {
325 Some(target) => Some(target.parse()?),
326 None => None,
327 };
328
329 // All dependencies are private by default
330 let public = public.unwrap_or(false);
331
332 // Unfortunately older versions of cargo and/or the registry ended up
333 // publishing lots of entries where the features array contained the
334 // empty feature, "", inside. This confuses the resolution process much
335 // later on and these features aren't actually valid, so filter them all
336 // out here.
337 features.retain(|s| !s.is_empty());
338
339 // In index, "registry" is null if it is from the same index.
340 // In Cargo.toml, "registry" is None if it is from the default
341 if !id.is_default_registry() {
342 dep.set_registry_id(id);
343 }
344
345 dep.set_optional(optional)
346 .set_default_features(default_features)
347 .set_features(features)
348 .set_platform(platform)
349 .set_kind(kind)
350 .set_public(public);
351
352 Ok(dep)
353 }
354 }
355
356 pub trait RegistryData {
357 fn prepare(&self) -> CargoResult<()>;
358 fn index_path(&self) -> &Filesystem;
359 fn load(
360 &self,
361 _root: &Path,
362 path: &Path,
363 data: &mut dyn FnMut(&[u8]) -> CargoResult<()>,
364 ) -> CargoResult<()>;
365 fn config(&mut self) -> CargoResult<Option<RegistryConfig>>;
366 fn update_index(&mut self) -> CargoResult<()>;
367 fn download(&mut self, pkg: PackageId, checksum: &str) -> CargoResult<MaybeLock>;
368 fn finish_download(
369 &mut self,
370 pkg: PackageId,
371 checksum: &str,
372 data: &[u8],
373 ) -> CargoResult<FileLock>;
374
375 fn is_crate_downloaded(&self, _pkg: PackageId) -> bool {
376 true
377 }
378 }
379
380 pub enum MaybeLock {
381 Ready(FileLock),
382 Download { url: String, descriptor: String },
383 }
384
385 mod index;
386 mod local;
387 mod remote;
388
389 fn short_name(id: SourceId) -> String {
390 let hash = hex::short_hash(&id);
391 let ident = id.url().host_str().unwrap_or("").to_string();
392 format!("{}-{}", ident, hash)
393 }
394
395 impl<'cfg> RegistrySource<'cfg> {
396 pub fn remote(
397 source_id: SourceId,
398 yanked_whitelist: &HashSet<PackageId>,
399 config: &'cfg Config,
400 ) -> RegistrySource<'cfg> {
401 let name = short_name(source_id);
402 let ops = remote::RemoteRegistry::new(source_id, config, &name);
403 RegistrySource::new(
404 source_id,
405 config,
406 &name,
407 Box::new(ops),
408 yanked_whitelist,
409 true,
410 )
411 }
412
413 pub fn local(
414 source_id: SourceId,
415 path: &Path,
416 yanked_whitelist: &HashSet<PackageId>,
417 config: &'cfg Config,
418 ) -> RegistrySource<'cfg> {
419 let name = short_name(source_id);
420 let ops = local::LocalRegistry::new(path, config, &name);
421 RegistrySource::new(
422 source_id,
423 config,
424 &name,
425 Box::new(ops),
426 yanked_whitelist,
427 false,
428 )
429 }
430
431 fn new(
432 source_id: SourceId,
433 config: &'cfg Config,
434 name: &str,
435 ops: Box<dyn RegistryData + 'cfg>,
436 yanked_whitelist: &HashSet<PackageId>,
437 index_locked: bool,
438 ) -> RegistrySource<'cfg> {
439 RegistrySource {
440 src_path: config.registry_source_path().join(name),
441 config,
442 source_id,
443 updated: false,
444 index: index::RegistryIndex::new(source_id, ops.index_path(), config, index_locked),
445 yanked_whitelist: yanked_whitelist.clone(),
446 index_locked,
447 ops,
448 }
449 }
450
451 /// Decode the configuration stored within the registry.
452 ///
453 /// This requires that the index has been at least checked out.
454 pub fn config(&mut self) -> CargoResult<Option<RegistryConfig>> {
455 self.ops.config()
456 }
457
458 /// Unpacks a downloaded package into a location where it's ready to be
459 /// compiled.
460 ///
461 /// No action is taken if the source looks like it's already unpacked.
462 fn unpack_package(&self, pkg: PackageId, tarball: &FileLock) -> CargoResult<PathBuf> {
463 // The `.cargo-ok` file is used to track if the source is already
464 // unpacked and to lock the directory for unpacking.
465 let mut ok = {
466 let package_dir = format!("{}-{}", pkg.name(), pkg.version());
467 let dst = self.src_path.join(&package_dir);
468 dst.create_dir()?;
469
470 // Attempt to open a read-only copy first to avoid an exclusive write
471 // lock and also work with read-only filesystems. If the file has
472 // any data, assume the source is already unpacked.
473 if let Ok(ok) = dst.open_ro(PACKAGE_SOURCE_LOCK, self.config, &package_dir) {
474 let meta = ok.file().metadata()?;
475 if meta.len() > 0 {
476 let unpack_dir = ok.parent().to_path_buf();
477 return Ok(unpack_dir);
478 }
479 }
480
481 dst.open_rw(PACKAGE_SOURCE_LOCK, self.config, &package_dir)?
482 };
483 let unpack_dir = ok.parent().to_path_buf();
484
485 // If the file has any data, assume the source is already unpacked.
486 let meta = ok.file().metadata()?;
487 if meta.len() > 0 {
488 return Ok(unpack_dir);
489 }
490
491 let gz = GzDecoder::new(tarball.file());
492 let mut tar = Archive::new(gz);
493 let prefix = unpack_dir.file_name().unwrap();
494 let parent = unpack_dir.parent().unwrap();
495 for entry in tar.entries()? {
496 let mut entry = entry.chain_err(|| "failed to iterate over archive")?;
497 let entry_path = entry
498 .path()
499 .chain_err(|| "failed to read entry path")?
500 .into_owned();
501
502 // We're going to unpack this tarball into the global source
503 // directory, but we want to make sure that it doesn't accidentally
504 // (or maliciously) overwrite source code from other crates. Cargo
505 // itself should never generate a tarball that hits this error, and
506 // crates.io should also block uploads with these sorts of tarballs,
507 // but be extra sure by adding a check here as well.
508 if !entry_path.starts_with(prefix) {
509 failure::bail!(
510 "invalid tarball downloaded, contains \
511 a file at {:?} which isn't under {:?}",
512 entry_path,
513 prefix
514 )
515 }
516
517 // Once that's verified, unpack the entry as usual.
518 entry
519 .unpack_in(parent)
520 .chain_err(|| format!("failed to unpack entry at `{}`", entry_path.display()))?;
521 }
522
523 // Write to the lock file to indicate that unpacking was successful.
524 write!(ok, "ok")?;
525
526 Ok(unpack_dir)
527 }
528
529 fn do_update(&mut self) -> CargoResult<()> {
530 self.ops.update_index()?;
531 let path = self.ops.index_path();
532 self.index =
533 index::RegistryIndex::new(self.source_id, path, self.config, self.index_locked);
534 self.updated = true;
535 Ok(())
536 }
537
538 fn get_pkg(&mut self, package: PackageId, path: &FileLock) -> CargoResult<Package> {
539 let path = self
540 .unpack_package(package, path)
541 .chain_err(|| internal(format!("failed to unpack package `{}`", package)))?;
542 let mut src = PathSource::new(&path, self.source_id, self.config);
543 src.update()?;
544 let mut pkg = match src.download(package)? {
545 MaybePackage::Ready(pkg) => pkg,
546 MaybePackage::Download { .. } => unreachable!(),
547 };
548
549 // After we've loaded the package configure it's summary's `checksum`
550 // field with the checksum we know for this `PackageId`.
551 let summaries = self
552 .index
553 .summaries(package.name().as_str(), &mut *self.ops)?;
554 let summary_with_cksum = summaries
555 .iter()
556 .map(|s| &s.0)
557 .find(|s| s.package_id() == package)
558 .expect("summary not found");
559 if let Some(cksum) = summary_with_cksum.checksum() {
560 pkg.manifest_mut()
561 .summary_mut()
562 .set_checksum(cksum.to_string());
563 }
564
565 Ok(pkg)
566 }
567 }
568
569 impl<'cfg> Source for RegistrySource<'cfg> {
570 fn query(&mut self, dep: &Dependency, f: &mut dyn FnMut(Summary)) -> CargoResult<()> {
571 // If this is a precise dependency, then it came from a lock file and in
572 // theory the registry is known to contain this version. If, however, we
573 // come back with no summaries, then our registry may need to be
574 // updated, so we fall back to performing a lazy update.
575 if dep.source_id().precise().is_some() && !self.updated {
576 debug!("attempting query without update");
577 let mut called = false;
578 self.index
579 .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, &mut |s| {
580 if dep.matches(&s) {
581 called = true;
582 f(s);
583 }
584 })?;
585 if called {
586 return Ok(());
587 } else {
588 debug!("falling back to an update");
589 self.do_update()?;
590 }
591 }
592
593 self.index
594 .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, &mut |s| {
595 if dep.matches(&s) {
596 f(s);
597 }
598 })
599 }
600
601 fn fuzzy_query(&mut self, dep: &Dependency, f: &mut dyn FnMut(Summary)) -> CargoResult<()> {
602 self.index
603 .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, f)
604 }
605
606 fn supports_checksums(&self) -> bool {
607 true
608 }
609
610 fn requires_precise(&self) -> bool {
611 false
612 }
613
614 fn source_id(&self) -> SourceId {
615 self.source_id
616 }
617
618 fn update(&mut self) -> CargoResult<()> {
619 // If we have an imprecise version then we don't know what we're going
620 // to look for, so we always attempt to perform an update here.
621 //
622 // If we have a precise version, then we'll update lazily during the
623 // querying phase. Note that precise in this case is only
624 // `Some("locked")` as other `Some` values indicate a `cargo update
625 // --precise` request
626 if self.source_id.precise() != Some("locked") {
627 self.do_update()?;
628 } else {
629 debug!("skipping update due to locked registry");
630 }
631 Ok(())
632 }
633
634 fn download(&mut self, package: PackageId) -> CargoResult<MaybePackage> {
635 let hash = self.index.hash(package, &mut *self.ops)?;
636 match self.ops.download(package, &hash)? {
637 MaybeLock::Ready(file) => self.get_pkg(package, &file).map(MaybePackage::Ready),
638 MaybeLock::Download { url, descriptor } => {
639 Ok(MaybePackage::Download { url, descriptor })
640 }
641 }
642 }
643
644 fn finish_download(&mut self, package: PackageId, data: Vec<u8>) -> CargoResult<Package> {
645 let hash = self.index.hash(package, &mut *self.ops)?;
646 let file = self.ops.finish_download(package, &hash, &data)?;
647 self.get_pkg(package, &file)
648 }
649
650 fn fingerprint(&self, pkg: &Package) -> CargoResult<String> {
651 Ok(pkg.package_id().version().to_string())
652 }
653
654 fn describe(&self) -> String {
655 self.source_id.display_index()
656 }
657
658 fn add_to_yanked_whitelist(&mut self, pkgs: &[PackageId]) {
659 self.yanked_whitelist.extend(pkgs);
660 }
661
662 fn is_yanked(&mut self, pkg: PackageId) -> CargoResult<bool> {
663 if !self.updated {
664 self.do_update()?;
665 }
666 self.index.is_yanked(pkg, &mut *self.ops)
667 }
668 }