Monday, May 26, 2008

s3 logs with webalizer

I recently saw the site s3stat.com, which is a simple service that takes your s3 logs and pushes them through webalizer to get the nice graphics and stats. The service is $2/month. I thought to myself that this can surely be done for free.

An hour or so later, I think I have something that's pretty comparable in features to s3stat. I am in no way trying to put them out of business. They maintain their website and are improving the s3stat with more features. I purely wanted and needed a way to view s3 stats, and didn't want to pay for it.

This script requires that a bucket on s3 has logging enabled. Please do so before using this script, or you will just get an error that logging is not turned on. The rubygem AWS::S3 is required to run this script. Please look at the options hash for required parameters.


#!/usr/bin/env ruby
require 'rubygems'
require 'aws/s3'
require 'getoptlong'
require 'tempfile'
require 'date'

#something wrong in the s3 gem
Date::ABBR_MONTHS = Date::Format::ABBR_MONTHS

#default arguments
options = {
:access_key=>'', #the Amazon access key
:secret_key=>'', #the Amazon secret key
:bucket_name=>'', #bucket name to pull logs from
:folder_name=>'webalizer' #foldername for webalizer output
#:clear_webalizer_folder => true #delete local webalizer data
}

#establish connection the s3
AWS::S3::Base.establish_connection!(
:access_key_id => options[:access_key],
:secret_access_key => options[:secret_key]
)

#find the bucket specifying the log files
puts "Checking for logging for bucket #{options[:bucket_name]}"
if AWS::S3::Bucket.logging_enabled_for?(options[:bucket_name])
new_log = File.new('bucket_log.log', 'w+')
log_status = AWS::S3::Bucket.logging_status_for(options[:bucket_name])
puts "Processing log files"
AWS::S3::Bucket.logs(options[:bucket_name]).each do |log|
#convert the lines of amazon s3 log to CLF (Common Log Format)
log.lines.each do |line|
new_log << "#{line.remote_ip} - - [#{line.time.strftime("%d/%B/%Y:%H:%M:%S %z")}] \"#{line.request_uri}\" #{line.http_status || '-'} #{line.bytes_sent || '-'} \"#{line.referrer}\" \"#{line.user_agent}\"\n"
end
end
new_log.close()
#make sure webalizer folder_name exists
if options[:clear_webazlier_folder] && File.exists?(options[:folder_name])
Dir["#{options[:folder_name]}/*"].each{|f| puts f; File.delete(f)}
Dir.delete(options[:folder_name])
end
Dir.mkdir(options[:folder_name])
#run webalizer on current log file
webalizer_output = `webalizer -o webalizer/ -D dns.db -N 5 -F clf bucket_log.log`
puts "output from webalizer:"
puts webalizer_output
#update webalizer bucket with newest info
puts "updating webalizer to s3 bucket #{log_status.target_bucket}"
Dir["#{options[:folder_name]}/*"].each do |filename|
puts "uploading file #{filename}"
AWS::S3::S3Object.store("#{filename}",open(filename),log_status.target_bucket,{:access=>:public_read})
end
end



If you have any improvements please let me know in the comments.

4 comments:

Jason Kester said...

Nice article. I've linked to it from the Resources page over at S3stat:

http://www.s3stat.com/web-stats/S3-resources.ashx

I wouldn't worry too much about putting us out of business. Yours is the 3rd step-by-step tutorial on how to do this yourself, and the first was actually written by me before I launched S3stat as a service.

I hope that people find your code useful!

Jason Kester said...

Oh, hey! Scanning thru your code, it looks like you've missed an important step.

Server Access Logs are not guaranteed to be contiguous. That is, the last entry from one file is not guaranteed to be dated earlier than the first entry from the next file, and vice versa. That means you can't simply stick them together, but rather must sort them on the fly to make sure they end up in the right order.

I imagine they'll probably still parse through Webalizer, but it will end up dropping off some data and slightly under-report usage.

Jason

kookster said...

Nice stuff. I had only one issue: at least with my version of webalizer, it likes the date format to have the 3 letter abbreviation for the date, so in the strftime I used a %b instead of %B, and it worked like a charm.

-Andrew Kuklewicz

dablyputs said...

Is there an easy way to delete the log once it's been processed? I am trying:
AWS::S3::S3Object.delete(log, log_status.target_bucket)

but I am getting an error:
usr/lib/ruby/gems/1.8/gems/aws-s3-0.5.1/lib/aws/s3/object.rb:300:in `join': can't convert AWS::S3::Logging::Log into String (TypeError)