27.11.05

Privoxy Filter for Google Analytics

I've wondered why a whole bunch of pages suddenly sent me cookies called "__utma". The answer was Google Analytics and the current hype about it.

It allows webmasters to easily track their users for free.

But because all this data gets stored and analyzed with and from Google, I don't want to get tracked by them.

So here is a filter to block this with Privoxy:

Put this at the end of default.filter:

FILTER: google-analytics Remove Google Analytics JS.
s|<script\s[^>]*?google-analytics.com/urchin.js[^>]*>.*?</script>||gis
s|\burchinTracker\(\);||gis

And activate it globally in user.action:

{ +filter{google-analytics} }
/

By Daniel in Privacy2005-11-27 English (EU) Email
Tags:

5 comments

Comment from: Forest [Visitor]
ForestDanke!
2008-07-20 @ 05:07
Comment from: Casey Jones [Visitor] Email
Casey Jones

Oh my, I'm terribly sorry! Let's try that once more...

As you're probably aware, since the time you wrote this helpful filter Google has released a newer Analytics script, called "ga.js" instead of "urchin.js", and it's function/method is called _gat._getTracker() rather than urchinTracker(). Here's an article about it, for anyone to whose attention this hadn't already come:

http://www.epikone.com/blog/2007/10/16/gajs-new-google-analytics-tracking-code/

I've tried my best to write a couple supplemental (* because the old Analytics code your filter deals with is also still in use) lines for your filter, but I'm a complete newb with Privoxy and would appreciate any syntax-checking that you/anyone can grace me with:

s|<script.*?google-analytics\.com/ga\.js*.?</script>||gis
s|<script\s[^>]*?_gat\._getTracker(*.).*?</script>||gis

There are a few things about Privoxy's filters' syntax that I still don't grasp after reading the brunt of Chapter 9: "Filter Files" in the manual. This chapter has a sort of storytelling approach, when what I would wish for is a very simple, cut-and-dry reference of filter operators.

Perhaps some things which it seems to me they left out will be more obvious to *nix-minded people since, for better or for worse, I come from a Windows background. I'm hoping someone can help me with the answers to these questions:

1) What does "[^>]*" mean? The manual says the following three things, but I still don't get it:

- "* means: 'Match an arbitrary number of the element left of myself'"
- "The ['"] construct means: 'a single or a double quote'."
- "s/(<body [^>]*)onunload(.*>)/$1never$2/iU
"... we had to use [^>]* instead of .* to prevent the match from exceeding the <body> tag if it doesn't contain "OnUnload", but the page's content does."

My interpretation is that this includes x-number of caret OR closing-bracket symbols in the filter. However, I don't know where caret (^) symbols are used in HTML... or how this would help keep the filter from exceeding the body tag. Probably I'm just ignorant of something simple, of which I hope someone is able to enlighten me!

2) What does a question-mark indicate? The manual says only the following two things:

- "s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig
"The ? in .*? makes this matching of arbitrary text ungreedy. (Note that the U option is not set)."
- "s/microsoft(?!\.com)/MicroSuck/ig
"Note the (?!\.com) part (a so-called negative lookahead) in the job's pattern, which means: Don't match, if the string ".com" appears directly following "microsoft" in the page."

Maybe I'm just dumb, but my mind still isn't sure it comprehends this question-mark abstract from these two anecdotal explanations. Isn't any string you write in the filters searched for regardless of whether or not the question-mark precedes it? I don't understand how it changes the filter's behavior. I've tried to use this syntax, er... intuitively in my additions (I guessed!).

3) The second line of your filter begins with "\b". What's that? The manual explains that "\s" indicates a variable amount of whitespace (or none), but I find no mention of "\b".

4) Although my additions to your filters are reported to work by the script...
http://config.privoxy.org/show-url-info?url=[insert URL for testing]
... the theoretically-filtered scripts still show up when I view the source-code of a page. Why is this? Probably it's explained somewhere in the manual, but I don't know where to begin looking and I'm a bit discouraged by my experience with its chapter on filters.

One fella' reports the same unchanged-source behavior I'm describing, in regard to your own filter, at this link:

http://sysblogd.wordpress.com/2007/12/06/how-to-among-others-block-google-analytics-java-script-urchinjs-from-revealing-your-site-usage/

He seems to think it's working, but neither he nor I are sure. Please help explain if you can.

PS: I noticed that you didn't escape your periods from the URL. It ought to process just the same, but formally shouldn't it be written "google-analytics\.com/urchin\.js"?

2008-12-05 @ 06:06
Comment from: terry chay [Visitor]
terry chay1) It means any text that isn’t a greater than sign
2) It means no greedy regex capture (i.e. .*? = capture the minimal amount of text that matches this pattern)
3) it's a word boundary. It means that the match has to be the beginning of a "word" (i.e. theurchinTracker() won't match but urchinTracker() will)
4) Try refreshing and remember to include the filter rule in user.action and have a trailing "/" so it filters all URLs.

I hope this helps.

Take care,

terry
2008-12-24 @ 23:46
Comment from: Casey Jones [Visitor]
Casey JonesThanks very much, Terry. Not long after posting that, I did come to the realization that Privoxy uses UNIX regular expressions. Believe it or not, I hadn't known of their existence before, and thought that Privoxy was using some kind of proprietary pattern-matching expressions which they didn't explain very well... But it was written directly at the beginning of Chapter 9 of Privoxy's manual that one should "be familiar with HTML syntax, and, of course, regular expressions." A forehead-slap is in order! Now I am enlightened...
2009-05-22 @ 11:44
Comment from: Dan [Visitor]
DanI believe you can also just do this:

{ +block{Google crap} +handle-as-empty-document}
google-analytics.com/.*\.js$
googlesyndication.com/.*\.js$
2009-06-23 @ 00:09

Leave a comment


Your email address will not be revealed on this site.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
This is a captcha-picture. It is used to prevent mass-access by robots.
Please enter the characters from the image above. (case insensitive)
You can just use your OpenID to provide your name, e-mail and url.
Seitenleiste