Best practice for proxying package repositories

I have a collection of CentOS servers in my corporate network. For security reasons, most servers do not have general outbound internet access unless it is a core functional requirement for the server.

This creates a challenge when I need to update packages. For yum repositories, I currently mirror all needed repos from the internet, and make the mirrors available inside the intranet. I keep copies of each repo in each of our five environments: dev, QA, staging, and two production datacenters.

I don't currently solve for language-specific package repos. When servers need an update from rubygems, PyPI, PECL, CPAN or npm, they have to acquire temporary outbound internet access to fetch the packages. I've been asked to start mirroring rubygems and PyPI, and the rest will probably follow.

All of this is clunky and doesn't work well. I'd like to replace it with a single caching proxy in one environment and four daisy-chained proxies in my other environments, to eliminate the complexity and disk overhead of full mirrors. Additionally:

It can be either a forward or reverse proxy; each package manager supports a proxy server or a custom repository endpoint, which could be either a local mirror or a reverse proxy.
It needs granular access control, so I can limit which client IPs can connect to which repo domains.
Clients need to be able to follow redirects to unknown domains. Your original request might be limited to rubygems.org, but if that server returns a 302 to a random CDN, you should be able to follow it.
It should support HTTPS backends. I don't necessarily need to impersonate other SSL servers, but I should be able to re-expose an HTTPS site over HTTP, or terminate and re-encrypt with a different certificate.

I was initially looking at reverse proxies, and Varnish seems to be the only one that would allow me to internally resolve 302 redirects within the proxy. However, the free version of Varnish does not support HTTPS backends. I'm now evaluating Squid as a forward proxy option.

This seems like something that ought to be a relatively common problem within enterprise networks, but I'm having trouble finding examples of how other people have solved this. Has anyone implemented something similar or have thoughts on how best to do so?

Thanks!

Best Answer

We use Squid for this; the nice thing about squid is that you can set individual expiry of objects based on a pattern match, fairly easily, which allows the metadata from the yum repo to be purged fairly quickly. The config we have which implements this:

refresh_pattern (Release|Packages(.gz)*)$      0       20%     2880
refresh_pattern (\.xml|xml\.gz)$      0       20%     2880
refresh_pattern ((sqlite.bz2)*)$      0       20%     2880
refresh_pattern (\.deb|\.udeb)$   1296000 100% 1296000
refresh_pattern (\.rpm|\.srpm)$   1296000 100% 1296000
refresh_pattern .        0    20%    4320

http://www.squid-cache.org/Doc/config/refresh_pattern/

Best Answer

Related Solutions

HTTP load balancer with support for dynamic backend membership

Nginx – reverse proxy that caches post requests

Related Topic