how can I sort out hits to my website

page last modified: 2005-02-04

Can I tell if visitors are finding my site witout a search?
Below is a shell script that might help find repeat visitors and / or those who access your site without
searching for it. To make it executable type chmod +x and then the name of the file.For example: name the
file check.sh then type [ chmod +x check.sh ] To run it type [ ./chech.sh ]
Have check.sh in the same directory as the log file you want to process and don't run it as root.
It took 2 minutes to process an 8mb log file, and 8 minutes to process a 39mb log file on a 900mhz machine.
The input must be a file named combined_log as this is what the script looks for.
The output is a file named unique_visits containing a single column of ip addresses that are
for the most part unique visitors who have accessed your site without searching for it.If you have several
month's worth of log info in one file you can determine if there are unique ip addresses accessing your
site repeatedly. This is far from perfect because of the dynamic ip thing, but it does give you a good idea
if the info on your site is of use to people for more that just a quick look. Lots of traffic and hits is
not necessarily a sign that people are interested in the content of your site. The search engines allow
people to find your site by the words and or phrases typed in; but do they like what they see and do they
tell other people about your site? This script separates unique ip addresses from the thousands that are
logged each day. It also separates hits that do not come from search engines from those that do. You end
up with a file named "unique_visits" with ip addresses in a column. If you compare the contents of this
file each month using
[ sdiff -s ] you can find repeat visitors. By counting the number of
lines in the file (gvim will tell you this when you open the file with it) you can tell how many hits
you are getting that don't come from search engines. Don't forget to rename the "unique_visits" file
before you re-run the script or it will over-write it and you won't know what happened the last time
you ran the script. The formatting takes up lots of room because I have not changed the line breaks in
the script. It should work if you copy and paste it into a text editor.This works on
RedHat with kernel 2.4.20-6. I haven't tried it on others yet. I do not know if it will work on any
other system.



#!/bin/sh
#eliminate search engine referals and zombie hunters. combined_log is the original file
egrep '(google)|(yahoo)|(mamma)|(query)|(msn)|(ask.com)|(search)|(altavista)|(images.google)|(xb1)|(cmd.exe)|(trexmod)|(robots.txt)|(copernic.com)|(POST)' combined_log > search
#now sort them to eliminate duplicates and put them in order
sort -un search > search_sort
#do the same with original file
sort -un combined_log > combined_log_sort
#now get all the ip addresses. only the numbers
grep -o '[0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*' search_sort > search_sort_ip
grep -o '[0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*' combined_log_sort > combined_log_sort_ip
sdiff -s combined_log_sort_ip search_sort_ip > final_result_ip
#get rid of the extra column
grep -o '^\|[0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*' final_result_ip > bookmarked_ip
#remove stuff like browser versions and system versions
egrep -v '(4.4.2.0)|(1.6.3.1)|(0.9.2.1)|(4.0.0.42)|(4.1.8.0)|(1.305.2.109)|(1.305.2.12)|(0.0.43.45)|(5.0.0.0)|(1.6.2.0)|(4.4.5.0)|(1.305.2.137)|(4.3.5.0)|(1.2.0.7)|(4.1.5.0)|(5.0.2.6)|(4.4.9.0)|(6.1.0.1)|(4.4.9.0)|(5.0.8.6)|(5.0.2.4)|(4.4.8.0)|(4.4.6.0)' bookmarked_ip > unique_visits

exit 0

This version works on Slackware 10. Kernel version 2.4.26.
It does not take out the extra column. I do not know if it will work on any other system.

#!/bin/sh
#eliminate search engine referals and zombie hunters. combined_log is the original file
egrep '(google)|(yahoo)|(mamma)|(query)|(msn)|(ask.com)|(search)|(altavista)|(images.google)|(xb1)|(cmd.exe)|(trexmod)|(robots.txt)|(copernic.com)|(POST)' combined_log > search
#now sort them to eliminate duplicates and put them in order
sort -un search > search_sort
#do the same with original file
sort -un  combined_log > combined_log_sort
#now get all the ip addresses. only the numbers
grep -o '[0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*' search_sort > search_sort_ip
grep -o '[0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*[.][0-9][0-9]*' combined_log_sort > combined_log_sort_ip
sdiff -s combined_log_sort_ip search_sort_ip > final_result_ip
#remove stuff like browser versions and system versions
egrep -v '(4.4.2.0)|(1.6.3.1)|(1.6.4.0)|(1.3.3.7)|(0.9.2.1)|(4.0.0.42)|(4.1.8.0)|(1.305.2.109)|(1.305.2.12)|(0.0.43.45)|(5.0.0.0)|(1.6.2.0)|(4.4.5.0)|(1.305.2.137)|(4.3.5.0)|(1.2.0.7)|(4.1.5.0)|(5.0.2.6)|(4.4.9.0)|(6.1.0.1)|(4.4.9.0)|(5.0.8.6)|(5.0.2.4)|(4.4.8.0)|(4.4.6.0)|(1.305.2.148)|(4.2.8.0)|(4.2.13.0)|(4.4.7.0)|(4.5.0.0)' final_result_ip > unique_visits
rm final_result_ip
rm search_sort_ip
rm combined_log_sort_ip
rm search
rm search_sort
rm combined_log_sort
exit 0