The Black Hole of Solaris: Squid Tutorial

I wrote this tutorial back in 2004 for a high street Linux computing magazine.

Squid is a stable, widely used and highly flexible web proxy cache. Its main purpose is to provide Internet browsing for machines that are not directly connected to the web. This can be usefully subdivided into two main elements. Firstly, the Squid server accepts requests for web pages from clients and retrieves them on their behalf. Therefore, it acts as a proxy. Secondly, the Squid server stores retrieved pages locally so that after the first request for a web page has been fulfilled, subsequent requests can be met without having to go onto the Internet at all. This is the caching function.

The National Laboratory for Applied Network Research, based in the United States, look after Squid, with a lot of help from unpaid volunteers across the globe. The lead programmer is a guy called Duane Wessels and he has done an excellent job of ensuring the efforts of a large team are suitably coordinated. Squid is released under the GNU GPL and is one of the great Open Source apps, although not as high profile as Apache, for example, or Samba. Nevertheless, it deserves just as much respect and is an essential component of many a network, large and small.

Proxy vs. NAT

Assuming that you do not have an Internet IP address for each of your desktops and servers, getting everybody onto the web will involve some trickery. The most popular solutions are either some kind of proxy or alternatively a system of Network Address Translation (NAT). In the Linux world, NAT is often referred to as masquerading. Which you will go for depends on your policies on certain configuration and security issues. It also depends, of course, on the type of network you have.

The twin headaches of security and performance tend to scale up so that organizations with lots of users, perhaps on a variety of platforms, are more in need of a web proxy cache like Squid than smaller outfits who may find it easier to get away without one. Of course, at the top of the market there are solutions available which can be integrated into an existing NAT-based network to address security concerns. These tend to be commercial products, and deep pockets are usually required. For most scenarios, Squid is a perfectly good method of allowing Internet access in an inexpensive, reasonably safe and generally manageable way.

A common complaint with the use of proxies is that you need to add them into the browser settings as well as configure any other web apps (RealPlayer, GAIM, etc.) that might be required. It can be frustrating when some web-based programs either don't support HTTP proxies, or, when they do, only support them without authentication. However, the situation is improving and the drawbacks of having to use a proxy, as opposed to getting the appearance of a direct connection, are not really that severe.

Installing Squid

Before you start, there's a minimal amount of preparation to do on your system. You will need GCC installed and also Perl or you won't get very far. However, in all likelihood both these pieces of software will already be in place. It is also a good idea, although not absolutely essential, to have Apache up and running as well. Again, this program is probably included with, or at least easily available for your platform. Better still, it can be readily compiled from source. You can verify that you have everything with the following commands:

# type gcc
gcc is /usr/bin/gcc
# type perl
perl is /usr/bin/perl
# type httpd
httpd is /usr/local/apache/bin/httpd

As always, the first thing to do when installing new software is to get a copy of the latest stable release. Squid lives at www.squid-cache.org where the source code and a lot of useful information can be obtained. The present version 2 strain has been live since 1998 and had reached 2.5 at the time of writing. In January 2003 the next editon, 2.6, is scheduled to be released. Development tends to be slow and steady, so even though 3.0 is being worked on, it may not arrive for some time.

RPMs are widely available with many distros including it in their server builds by default. Whilst these are generally fine, as with any RPM they can lag behind the latest source code release, put files where you don't want them and not always have the features you want already compiled in. For these reasons we are going to install from source, however, if you prefer to use a binary it doesn't matter as the rest of the article will still be relevant, even if file locations may be slightly different.

So, once we have obtained the source code from the Squid website or one of its mirrors, and ensured it is in a suitable location, it can be uncompressed and extracted like this:

# tar xvzf squid-2.5.STABLE1.gz

The next step is to run the configure script. This prepares the installation and controls what features of Squid are included or excluded. There are many possible options to this script, but for the moment we shall ignore the majority of them and go for a default setup. The only thing explicity specified here is the location of our installation. All Squid files and directories will be placed in a structure beneath this point.

# cd squid-2.5.STABLE1
# ./configure --prefix=/usr/local/squid

Some of the other possible configure options will be covered below. At the moment though, we're ready to finish the installation and this is done by typing:

# make all
# make install

If everything has worked successfully, the main Squid binary, helpfully called "squid", will be placed in "/usr/local/squid/sbin". To make things nice and simple, this directory should be added to your PATH enviroment variable in the file "~/.bash_profile".

Configuration

Now its time to get Squid working the way we want it. This is done by editing the squid.conf file, which is to be found, in our case, by going to the "/usr/local/squid/etc" directory. It is rather large, but very well laid out and follows a similar syntax to Apache. You shouldn't have many problems doing what you need to do by pointing your text editor of choice at "squid.conf". However, for those that prefer something a bit more graphical, the Squid module that comes as part of the standard Webmin installation is really very good.

For the most part, the system defaults will be fine to getting on with. Even at this stage, though, we can make a few well-chosen modifications at the correct point within squid.conf:

http_port 8080

This is the port that your users' browsers will connect to. The default is 3128, but 8080 is also frequently used. We could also specify a hostname or IP address here, which would be useful on servers with more than one network card.

visible_hostname brasseye.mydomain.com

I have found it necessary to add the fully qualified domain name of my Squid server here. Although the program should be able to work this out for itself, it can fail to do so and this will cause Squid to abort at startup. In my view, it's best to be on the safe side and add the entry in.

cache_dir ufs /usr/local/squid/var/cache 500 16 256

The cache directory is the home of the web pages that Squid stores and as such it is very important that it is configured as appropriate for your system. The first of the three numbers is the cache size in megabytes, and should ideally be as big as possible. The default is 100MB but it's not unusual to see this changed to several Gigabytes or more. Multiple entries of "cache_dir" and Squid will use all of them, spreading the load across multiple disks.

The third field names the cache directory itself, and for a server that's likely to be heavily-used, this should really be a file-system in its own right, if not actually on a separate disk with its own controller. The final two numbers relate to first and second level subdirectories of the cache. With a very large cache it might be a good idea to increase these too.

Well, with those changes saved away, we should have a workable implementation of Squid more or less ready to go. Before we kick the program off, though, there are a couple of final steps. Permissions can cause a problem, because by default Squid's effective user when executed by root is actually nobody (as in the system account of that name). Nobody will need write permissions to the log and cache directories before Squid will start properly. This can quickly be achieved by doing the following:

# cd /usr/local/squid/var
# mkdir -m 777 cache logs

Almost there now, but before we can start using Squid the cache needs to be initialised:

# squid -z

That may take a few minutes, but when it does we can finally run Squid, and this is done by simply executing the program without any options:

# squid

By doing this, Squid automatically starts in daemon mode, so you should get your prompt right back. Then it's time to point a browser at your new web proxy cache and start surfing. If you have any problems, the logfile cache.log should be the first place you look, and will usually give you human readable error messages that you can use for debugging. The other main logfiles are "access.log", which keeps a record of each HTTP page request received, and "store.log", which tracks each item retained in the cache.

Authentication

Before we get too pleased with ourselves, it's worth remembering that the Squid server we've just setup has absolutely no security features whatsoever. Anyone whose connected to our LAN/WAN can use it as long as they know the hostname and port number of our server. There is logging taking place, but it is only trackable by the client PC, not by the user doing the surfing. No URL filtering is taking place either, so these users could be going to any number of direputable sites and we can't stop them doing it nor can we reliably track them down after they've done it.

In fact, since what we've done is little better than give every machine on our network a potential unmonitored connection, it would be a really good idea to turn it off again ASAP:

# squid -k shutdown

For peace of mind we need to get some sort of authentication process in place. This basically means tying the users of the proxy down to certain properly validated individuals. There are quite a few different methods for doing this, and in fact Squid version 2.5 has completely overhauled the way it handles authentication, offering much improved functionality. You could go for LDAP, for instance, or maybe use Linux's own /etc/passwd file. Squid has now been integrated with Samba's excellent Winbind daemon, so if you're running a Windows network users can be recognised from their domain accounts without even the need to enter a username or password.

The method we're going to use here, though, is the traditional NCSA form of authentication. Basically, this involves creating and maintaining a simple encrypted password file for Squid's exclusive use. To get this working we need first to install it from the source code:

# cd squid-2.5.STABLE1/helpers/basic-auth/NCSA
# make
# make install

In fact, we could have avoided these step if when we were running the Squid configure script we'd specified that we wanted NCSA like so:

# ./configure --prefix=/usr/local/squid --enable-basic-auth-helpers=NCSA

After this is done, the squid.conf needs to be amended to reflect our new aims. We have to tell it about the authentication parameters we want to set:

auth_param basic program /usr/local/squid/libexec/ncsa_auth /usr/local/squid/etc/passwd
auth_param basic realm Our brand new web proxy cache

The first line is the important one, as it tells Squid which authentication program to use and where the corresponding password file is kept. The second line is more or less a comment field, and it will appear on the clients' login box whenever they need to start using the proxy. The only other thing that we need to do in squid.conf is add the relevant access control list (acl) and set up a corresponding http access rule:

acl password proxy_auth REQUIRED
http_access allow password
http_access deny all

The purpose of those entries to ensure that anyone who correctly provides a username and password is allowed to use the HTTP proxy, but if this is not done then access is denied. To create the password file and put some users into it we need to use the htpasswd command. This comes as part of Apache, but will work on its own if you really don't want to permenantly install a webserver alongside Squid. The password file is created when you add the first user, which should be done like so:

# htpasswd -c /usr/local/squid/etc/passwd chris

You will be prompted to enter and then confirm the desired password. Subsequently, users can be added and have their passwords modified by using the same command but without the "-c" option. To delete a user you will have to manually remove the line relating to them from the password file. If that sounds like a forbiddingly unfriendly process, you could just use Webmin as Squid users can be perfectly well managed from that. To make things even easier, there's a CGI script available from www.squid-cache.org which enables users to change their own passwords from a web page. It's called chpasswd.cgi and is well worth a look.

URL Filtering

Something else you might want to do, especially if you're operating in a business environment, is to disallow access to certain sites. This is fairly easy to configure using an acl and then an http_access statement:

acl bansites url_regex "/usr/local/squid/etc/bannedsites.txt"
http_access deny bansites
http_access allow password
http_access deny all

The file specified in the acl is a simple text file listing all the URLs that are out of bounds. It should be noted that banning is performed by regular expression pattern matching, so you don't have to have complete URLs on the list, but it is a good idea to not be too general. Banning the word "sex" for instance may seem like a sensible move, but that would also take out www.sussex.com and www.davidessex.co.uk to name but two. The order of the http_access statements are important, as they are applied in that order until a match is found. So something cannot be rejected if a previous rule has already approved it.

If we want to, we can get a bit more sophisticated with our filtering. It is possible to only ban sites at certain times, say during normal business hours, and allow the traffic to go through on evenings and weekends. Here's a basic example of how this could be achieved:

acl restrictedsites url_regex "/usr/local/squid/etc/restrictedsites.txt"
acl restrictedtimes time MTWHF 09:00-17:00
http_access deny restrictedsites restrictedtimes
http_access allow password
http_access deny all

It is possible to combine the entries for restricted and banned sites and implement both policies simultaneously, and indeed any others along similar lines, if so desired. Of course, you may not wish to trawl the Internet building up your own list of dodgy webpages. Pre-prepared files are obtainable from various sources, such as www.squidblock.com or www.squidguard.org/blacklist. Even these, though, are by no means exhaustive and with the speed at which domains come and go on the web it's unlikely that they ever will be.

More Things to do with Squid

One possible further modification that might be required is the activation of SNMP monitoring for your proxy. This could be very useful if used in conjunction with a decent monitoring tool such as Cricket or MRTG. It would also be a neat way of circumventing Squid's own native analysis program, the distinctly uninspiring Cache Manager. SNMP support needs to be compiled in, so to get it you should return to the source directory and:

# ./configure --enable-snmp
# make clean
# make all
# make install

Access to the SNMP port is denied by default, so to get it working you'll need to add this in squid.conf:

snmp_access allow all

If "allow all" seems a bit on the open side, access rights can be refined by the use of acls. Another situation that could well come up, indeed almost certainly will, is that the log files grow to a large size. These can be rotated using an option of the squid command, which is best scheduled under cron:

0 0 * * 6 /usr/local/squid/sbin/squid -k rotate

Instead of acting as an HTTP web proxy cache, Squid can also be setup as an HTTP accelerator. The purpose of this is to act on behalf of a webserver rather than clients' browsing. It would be most useful if a webserver sits behind some kind of network bottleneck. The Squid server running as an accelerator is then placed beyond the weak point, and it satisfies incoming HTTP requests itself. All the real webserver has to do is be accessible to the accelerator.

This tutorial has covered the basics, but Squid is an application with many different possible uses and a mind-boggling amount of potential configurations. It could almost certainly provide some real benefits to a network of any size that is connected to the Internet.

Saturday, 10 August 2013

Squid Tutorial

No comments:

Post a Comment

About Me