Skip to main
Article
Spam semi truck

Blocking Analytics Spam

Google Analytics is great for gathering data on who uses your web application, but becomes worthless if spam sessions start infesting your data. Here’s how we’ve tried to combat the problem for oddbird.net.

Like many websites, we use Google Analytics to gather data about our users – what OS and browser they used, how they came to our site, etc. But a number of months ago we started seeing lots of this:

Google Analytics spam

It’s not a new problem, but it’s particularly problematic for smaller sites that don’t receive lots of traffic. On a given day, spam hits were accounting for anywhere from ten to ninety (!) percent of our sessions.

Google Analytics total sessions and filtered (non-spam) sessions

There are many solutions out there; since we mostly saw spam in the “referral” field, we wanted a simple way to block spam referrals from being included in our analytics data.

One common approach is to disallow any site visits where document.referrer matches a known spam domain. There are free services that create the necessary Google Analytics “filters” for you, but they must be re-configured frequently as new spammers are added to the list.

Instead, we tried spam-referrals-blocker, which is a script that blocks referrals found on a community-contributed list of referrer spammers. Rather than relying on the owner of the script to update it periodically with the latest disallowed-list – or maintaining our own fork of the repo – we decided to fetch the latest list as part of our build/deploy process, using gulp and gulp-download:

const gulp = require('gulp');
const download = require('gulp-download');

gulp.task('update-spammers', () => {
  const url = 'https://raw.githubusercontent.com/matomo-org/referrer-spam-blacklist/master/spammers.txt';
  return download(url).pipe(gulp.dest('path/to/js/'));
});

Once we have an up-to-date disallowed-list, we import it with the webpack raw-loader and block any referrer found on the list:

import spammers from 'raw-loader!./spammers.txt';

window.isSpamReferral = function () {
  const list = spammers.split(' ');
  const currentReferral = document.referrer;
  if (currentReferral) {
    for (const spammer of list) {
      if (spammer && currentReferral.indexOf(spammer) !== -1) {
        return true;
      }
    }
  }
  return false;
};

And in our HTML, after the JS file has been executed:

<script>
  if(!window.isSpamReferral()) {
    // ... initialize Google Analytics
  }
</script>

Bonus: Excluding Internal Traffic

Without much extra work, we can also exclude internal traffic from our analytics data:

const devHosts = [
  // List your local development servers
  'localhost:3000',
  '127.0.0.1:3000'
];

window.isDevelopment = () => devHosts.indexOf(window.location.host) !== -1;

And our modified HTML:

<script>
  if(!window.isSpamReferral() && !window.isDevelopment()) {
    // ... initialize Google Analytics
  }
</script>

This approach has worked relatively well – in the first two weeks, we only saw nine spam sessions sneak through. But we weren’t entirely thrilled with it, either.

First of all, a disallowed-list of domains-to-block is much more difficult to maintain than an allowed-list of domains-to-allow (even if we’ve off-loaded most of the maintenance to the community). And second, there’s something less-than-ideal about fetching a raw .txt file directly from someone else’s GitHub repo, making assumptions about the format of the file contents, and then relying on it as part of our build/deploy process.

So we’ve recently also implemented some improvements, most notably using an allowed-list filter to exclude any hostnames we haven’t explicitly authorized. This takes care of most of the spam, and is arguably cleaner and easier to maintain.

We haven’t been using this technique for long, but so far the results have been positive. If it continues to work well, we’ll likely remove the referral-blocking code entirely.

If you use Google Analytics, how have you tackled the problem of spam infecting your data? Let us know via Twitter!

Recent Articles

  1. A chain-link gate in black and white with a sign that says closed indefinitely, and a smaller warning with gruesome icons for entrapment (a person being smashed) and pinching (a hand going through gears)
    Article post type

    How do we move logical shorthands forward?

    There are several proposals, but one major road block

    We’re trying to make progress on shorthand syntax for CSS logical properties. But the path forward depends on where we hope to be a decade from now.

    see all Article posts
  2. block-size, inline-size, size?
    Article post type

    Support Logical Shorthands in CSS

    Can we get this process unstuck?

    The CSS Working Group recently resolved to add a size shorthand for setting both the width and height of an element. Many people asked about using it to set the ‘logical’ inline-size and block-size properties instead. But ‘logical shorthands’ have been stalled in the working group for years. Can we…

    see all Article posts
  3. A hand with painted nails placing a white square of paper into a 9 by 9 grid.
    Article post type

    Better Anchor Positioning with position-area

    It’s not just a shorthand for anchor()

    position-area might be my favorite part of the CSS Anchor Positioning spec, with a ton of features packed in to make things just… work. But there’s no magic here, just a few key parts that work well.

    see all Article posts