About Hololog

Hololog: Holistic Web server logfile analysis.

About Hololog

Hololog is a holistic logsite analysis tool. That is, it is designed to give you an overview of who is using your Web site, and also to let you drill down to see individual browser sessions. You can learn how people are reaching your Web site, which pages they looked at, which pages are most popular, which are accounting for the most bandwidth, and more.

You can use Hololog to improve your Web site. If people are searching for something and finding one of your Web pages, consider adding links on that page to something that might help them more, or even adding more content in that area. If people are reaching the right index page but not finding the content, make the links to the content more prominent. If people look at some of your pages, then return again to a different page after going back to a search engine, consider adding a direct link between the pages.

In other words, by looking at people's behaviour you can make a Web site that's easier to use and more effective.

Hololog was originally written by Liam Quin over a period of several years. Several other people have contributed ideas, including Dr Ian Graham, Laurie Harper, Mark Loeser, and others.

The Summary Page

Use the summary page to see who has been getting the most use out of your Web site. You will quickly get a feel for whether most people look at only one or two pages and then wander off.

There is a row for each Internet address from which your site was visited. This means that a large corporation with a firewall will probably only get a single entry. The columns shown in each row are as follows.

@
This is a link to the corresponding Web site, determined heuristically and sometimes very simplisticly.
Internet Host
This is the Internet address of the host, or computer, that accessed your Web site. If you chose the option Show simplified domains you may see a * here to show where names were joined together.
N
This is the total number of hits, including not just HTML documents but also images, thumbnails, icons and any other files transmitted. The number itself is a link that will display the complete list of files.
Description
This is a short description of the given host or domain. It will try to identify the country of origin, whether commercial or non-profit, and sometimes some other details. You can also override the descriptions using a small database included in the hololog distribution; see the mkdomaindb.pl script to update it. There is not yet a user interface for updating the descriptions.
First Visit
This is the date of the first visit registered in the HTTP server logfile the script is using.
Last Visit
This is the most recent visit from that site that was found in the HTTP server logfile the script is using.

At the end of the summary is a total including internal hits - that is, files fetched from computers in the same Internet domain as the Web server. There is also a form so you can change the options.

Summary Options

The Summary view is generated by the status.cgi CGI script; it understands the following options:

Show simplified domains

This option joins together all hosts from each domain, so that, for example, dirk.holoweb.net and mail.holoweb.net both appear in a single entry as *.holoweb.net and their hits are counted together.

Set join-similar=yes in the URL to enable this.

Starting date

This is the single day to include in the summary.

Set start-day, start-month and start-year, in the URI, e.g. start-day=22;start-month=Nov;start-year=2002;

If showall is set, these dates are ignored.

Show entire log (showall)
Select this to yes to see a summary for the entire logfile, not just a single day.

Details by Session

This is the most useful of all the reports. The CGI script nph-sortbyip.cgi produces this report, and can be configured extensively.

Each entry starts with the host name of a computer that contacted your Web server. There's then a count of the number of files downloaded and the total size in bytes, Kilobytes, Megabytes or Gigabytes.

If thumbnails of images were downloaded they are listed separately and not given separate entries later on. This reduces clutter in the summary.

After the heading line you'll see details of each file fetched from your server. Each file is (in Web terminology) a representation of a resource, but the partial URI logged (e.g. /~liam/) is whatever the Web server wrote in the log file.

The hits are numbered as they are read from the log file. Since some Web servers (including Apache) run multiple threads writing to the same log file, the entries are not always in the order you might expect, so the CGi script sorts them by date.

After the number is the URI that was fetched, after removing the http:// and the name of your Web server (simply to save clutter). This URI may be coloured differently depending on whether or not you've visited that page in this browser before (vlink configuration option).

After the URI comes a date, an HTTP status code such as 404 for file not found, and a byte count. Note that when the file is notfound the byte count is the size of your Error page.

Some entries are in a slightly different format, showing only Total fetch count in this period and not individual entries. These sumamry entries are intended for spiders such as Inktomi or Google that fetch every Web page on your site that they can find in order to add them to their indexes.

Session View Options

Note: The nph-sortbyip.cgi script is very configurable. In most cases the defaults should work, but you will need to change the file config.xml as per instructions in that file.

The following optins are accepted as CGI paraneters:

startdate

Set startdate to the first day you want to include in the report. For example, startdate=22/Nov/2005 would start the report with the first line in the log file matching that date. If there are no matching lines you won't get a report, even if the report starts after that date.

enddate

If you set enddate=23/Nov/2005 then the report will stop at the first line that matches this line. The default is for it to be unset, so that the entire log file is processed starting with the first match of startdate.

The Configuration File

The file config.xml must be in the same directory as the CGI scripts. If you are running Mandrake Linux you may find that the default configuration works for you without changes.

The file is in XML, and must be well-formed. If you have the libxml package installed then you can use xmllint config.xml | wc to check that the file is OK. Do this whenever you edit the file by hand.

This version of Hololog does not include a GUI for editing the config - sorry. It's high on the todo list. For now, use your favourite text editor and check the file afterwards.

Here is a sample:

<?xml version="1.0"?>
<config>
site name for the title
<sitename>Holoweb</sitename>
site domain, omitting any leading www. Referring URLs from anywhere under this domain are treated as local.
<sitedomain>holoweb.net</sitedomain>
log files
<logfile>/home/lee/public_html/httpd-access.log</logfile>
<errorlog>/var/log/httpd/error_log</errorlog>
<scriptbase>http://127.0.0.1/~liam/hololog/</scriptbase>
<body>
colours
<text>#663366</text>
<bgcolor>#FFFFFF</bgcolor>
<link>#3333ff</link>
<vlink>#993333</vlink>
<alink>#FF0000</alink>
<countbar>#F0CF8D</countbar>
</body>
next line is maintained by CVS or RCS
<id>$Id$</id>
</config>