Privoxy Filter for Google Analytics
I've wondered why a whole bunch of pages suddenly sent me cookies called "__utma". The answer was Google Analytics and the current hype about it.
It allows webmasters to easily track their users for free.
But because all this data gets stored and analyzed with and from Google, I don't want to get tracked by them.
So here is a filter to block this with Privoxy:
Put this at the end of default.filter:
FILTER: google-analytics Remove Google Analytics JS.
s|<script\s[^>]*?google-analytics.com/urchin.js[^>]*>.*?</script>||gis
s|\burchinTracker\(\);||gis
And activate it globally in user.action:
{ +filter{google-analytics} }
/
5 comments
Oh my, I'm terribly sorry! Let's try that once more...
As you're probably aware, since the time you wrote this helpful filter Google has released a newer Analytics script, called "ga.js" instead of "urchin.js", and it's function/method is called _gat._getTracker() rather than urchinTracker(). Here's an article about it, for anyone to whose attention this hadn't already come:
http://www.epikone.com/blog/2007/10/16/gajs-new-google-analytics-tracking-code/
I've tried my best to write a couple supplemental (* because the old Analytics code your filter deals with is also still in use) lines for your filter, but I'm a complete newb with Privoxy and would appreciate any syntax-checking that you/anyone can grace me with:
s|<script.*?google-analytics\.com/ga\.js*.?</script>||gis
s|<script\s[^>]*?_gat\._getTracker(*.).*?</script>||gis
There are a few things about Privoxy's filters' syntax that I still don't grasp after reading the brunt of Chapter 9: "Filter Files" in the manual. This chapter has a sort of storytelling approach, when what I would wish for is a very simple, cut-and-dry reference of filter operators.
Perhaps some things which it seems to me they left out will be more obvious to *nix-minded people since, for better or for worse, I come from a Windows background. I'm hoping someone can help me with the answers to these questions:
1) What does "[^>]*" mean? The manual says the following three things, but I still don't get it:
- "* means: 'Match an arbitrary number of the element left of myself'"
- "The ['"] construct means: 'a single or a double quote'."
- "s/(<body [^>]*)onunload(.*>)/$1never$2/iU
"... we had to use [^>]* instead of .* to prevent the match from exceeding the <body> tag if it doesn't contain "OnUnload", but the page's content does."
My interpretation is that this includes x-number of caret OR closing-bracket symbols in the filter. However, I don't know where caret (^) symbols are used in HTML... or how this would help keep the filter from exceeding the body tag. Probably I'm just ignorant of something simple, of which I hope someone is able to enlighten me!
2) What does a question-mark indicate? The manual says only the following two things:
- "s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig
"The ? in .*? makes this matching of arbitrary text ungreedy. (Note that the U option is not set)."
- "s/microsoft(?!\.com)/MicroSuck/ig
"Note the (?!\.com) part (a so-called negative lookahead) in the job's pattern, which means: Don't match, if the string ".com" appears directly following "microsoft" in the page."
Maybe I'm just dumb, but my mind still isn't sure it comprehends this question-mark abstract from these two anecdotal explanations. Isn't any string you write in the filters searched for regardless of whether or not the question-mark precedes it? I don't understand how it changes the filter's behavior. I've tried to use this syntax, er... intuitively in my additions (I guessed!).
3) The second line of your filter begins with "\b". What's that? The manual explains that "\s" indicates a variable amount of whitespace (or none), but I find no mention of "\b".
4) Although my additions to your filters are reported to work by the script...
http://config.privoxy.org/show-url-info?url=[insert URL for testing]
... the theoretically-filtered scripts still show up when I view the source-code of a page. Why is this? Probably it's explained somewhere in the manual, but I don't know where to begin looking and I'm a bit discouraged by my experience with its chapter on filters.
One fella' reports the same unchanged-source behavior I'm describing, in regard to your own filter, at this link:
http://sysblogd.wordpress.com/2007/12/06/how-to-among-others-block-google-analytics-java-script-urchinjs-from-revealing-your-site-usage/
He seems to think it's working, but neither he nor I are sure. Please help explain if you can.
PS: I noticed that you didn't escape your periods from the URL. It ought to process just the same, but formally shouldn't it be written "google-analytics\.com/urchin\.js"?
2) It means no greedy regex capture (i.e. .*? = capture the minimal amount of text that matches this pattern)
3) it's a word boundary. It means that the match has to be the beginning of a "word" (i.e. theurchinTracker() won't match but urchinTracker() will)
4) Try refreshing and remember to include the filter rule in user.action and have a trailing "/" so it filters all URLs.
I hope this helps.
Take care,
terry
{ +block{Google crap} +handle-as-empty-document}
google-analytics.com/.*\.js$
googlesyndication.com/.*\.js$
