Centos – squid and caching of dnf/yum downloads

centosfedorasquid

Sorry if this is a newbie question. I try to describe the situation first, then the squid questin will come in.

The current Fedora/Centos installations have in their normal configuration files in /etc/yum.repos.conf a metalink which looks like this.

metalink=https://mirrors.fedoraproject.org/metalink?repo=fedora-$releasever&arch=$basearch

This metalink actually makes yum/dnf pick a "random" server site (picked by the server random geographically by world region according to the location by the client of the metalink).
This also is used in case of slow download to switch to the next better site.

I noticed due to docker builds a lot of downloads, that why i am considering a squid proxy which all machines must use. But this "random" strategy of yum/dnf, worries me. I do understand the intention of fedora/centos to distribute the load of of this free repositories, so actually I do not want to undermine this strategy

Can squid somehow intelligently detect, that the client just uses "another fedora/centos repo url" and intelligently cache this? The metalink list in itself seems to be pretty stable (it just changes the order when asked, but it the list itself seems to be the same).

Intention: Do not store 1000 copies of the same file, only because it is from a different server.

How would i do that with squid?

EDIT: Does somebody have experience using this http://wiki.squid-cache.org/Features/StoreID for caching of dnf/yum?

Best Answer

Answering my own question. Found out that squid has support for handling this kind of problem with the storeid_file_rewrite script. The only tricky thing is to get a valid list of urls, which represent the same repositories. Seems to work fine so far.

Added to squid.conf the following

store_id_program /usr/lib64/squid/storeid_file_rewrite /etc/squid/fedora.db
store_id_access allow localnet
store_id_access deny all

To get the content for the fedora.db (caching fedora 25 at this point in time) is some trickery with getting the urls from the mirrorlist

basearch="x86_64"
releasever=25
mirrorlist="https://mirrors.fedoraproject.org/metalink?repo=fedora-$releasever&arch=$basearc
curl -s "$mirrorlist" >tmp.db

You need to convert the "url" in the "tmp.db" result into the format explained here http://wiki.squid-cache.org/Features/StoreID/DB. This can possibly automated (Any volunteers?)

Then you get something like this as "fedora.db", which is used in squid.conf above.

^http:\/\/ftp\.halifax\.rwth-aachen\.de\/fedora\/linux\/releases\/25\/Everything\/(x86_64\/[a-zA-Z0-9\-\_\.\/]+rpm)$    http://repo.mirrors.squid.internal/fedora/25/$1
^http:\/\/mirror2\.hs-esslingen\.de\/fedora\/linux\/releases\/25\/Everything\/(x86_64\/[a-zA-Z0-9\-\_\.\/]+rpm)$        http://repo.mirrors.squid.internal/fedora/25/$1
^http:\/\/fedora\.tu-chemnitz\.de\/pub\/linux\/fedora\/linux\/releases\/25\/Everything\/(x86_64\/[a-zA-Z0-9\-\_\.\/]+rpm)$      http://repo.mirrors.squid.internal/fedora/25/$1

... much more

EDIT: Alternative, a more dangerous path, but maybe also sufficient, a more global pattern matching like this:

\/fedora\/linux\/releases\/([0-9]+)\/Everything/x86_64\/(.*)$   http://repo.mirrors.squid.internal/fedora/releases/$1/$2
\/fedora\/linux\/updates\/([0-9]+)\/x86_64\/(.*)$       http://repo.mirrors.squid.internal/fedora/updates/$1/$2

Sources:

Related Topic