Don’t Be Fooled by Data: 4 Data Analysis Pitfalls & How to Avoid Them

Posted by Tom.Capper

Digital marketing is a proudly data-driven field. Yet, as SEOs especially, we often have such incomplete or questionable data to work with, that we end up jumping to the wrong conclusions in our attempts to substantiate our arguments or quantify our issues and opportunities.

In this post, I’m going to outline 4 data analysis pitfalls that are endemic in our industry, and how to avoid them.

1. Jumping to conclusions

Earlier this year, I conducted a ranking factor study around brand awareness, and I posted this caveat:

“…the fact that Domain Authority (or branded search volume, or anything else) is positively correlated with rankings could indicate that any or all of the following is likely:

  • Links cause sites to rank well
  • Ranking well causes sites to get links
  • Some third factor (e.g. reputation or age of site) causes sites to get both links and rankings”
    ~ Me

However, I want to go into this in a bit more depth and give you a framework for analyzing these yourself, because it still comes up a lot. Take, for example, this recent study by Stone Temple, which you may have seen in the Moz Top 10 or Rand’s tweets, or this excellent article discussing SEMRush’s recent direct traffic findings. To be absolutely clear, I’m not criticizing either of the studies, but I do want to draw attention to how we might interpret them.

Firstly, we do tend to suffer a little confirmation bias — we’re all too eager to call out the cliché “correlation vs. causation” distinction when we see successful sites that are keyword-stuffed, but all too approving when we see studies doing the same with something we think is or was effective, like links.

Secondly, we fail to critically analyze the potential mechanisms. The options aren’t just causation or coincidence.

Before you jump to a conclusion based on a correlation, you’re obliged to consider various possibilities:

  • Complete coincidence
  • Reverse causation
  • Joint causation
  • Linearity
  • Broad applicability

If those don’t make any sense, then that’s fair enough — they’re jargon. Let’s go through an example:

Before I warn you not to eat cheese because you may die in your bedsheets, I’m obliged to check that it isn’t any of the following:

  • Complete coincidence – Is it possible that so many datasets were compared, that some were bound to be similar? Why, that’s exactly what Tyler Vigen did! Yes, this is possible.
  • Reverse causation – Is it possible that we have this the wrong way around? For example, perhaps your relatives, in mourning for your bedsheet-related death, eat cheese in large quantities to comfort themselves? This seems pretty unlikely, so let’s give it a pass. No, this is very unlikely.
  • Joint causation – Is it possible that some third factor is behind both of these? Maybe increasing affluence makes you healthier (so you don’t die of things like malnutrition), and also causes you to eat more cheese? This seems very plausible. Yes, this is possible.
  • Linearity – Are we comparing two linear trends? A linear trend is a steady rate of growth or decline. Any two statistics which are both roughly linear over time will be very well correlated. In the graph above, both our statistics are trending linearly upwards. If the graph was drawn with different scales, they might look completely unrelated, like this, but because they both have a steady rate, they’d still be very well correlated. Yes, this looks likely.
  • Broad applicability – Is it possible that this relationship only exists in certain niche scenarios, or, at least, not in my niche scenario? Perhaps, for example, cheese does this to some people, and that’s been enough to create this correlation, because there are so few bedsheet-tangling fatalities otherwise? Yes, this seems possible.

So we have 4 “Yes” answers and one “No” answer from those 5 checks.

If your example doesn’t get 5 “No” answers from those 5 checks, it’s a fail, and you don’t get to say that the study has established either a ranking factor or a fatal side effect of cheese consumption.

A similar process should apply to case studies, which are another form of correlation — the correlation between you making a change, and something good (or bad!) happening. For example, ask:

  • Have I ruled out other factors (e.g. external demand, seasonality, competitors making mistakes)?
  • Did I increase traffic by doing the thing I tried to do, or did I accidentally improve some other factor at the same time?
  • Did this work because of the unique circumstance of the particular client/project?

This is particularly challenging for SEOs, because we rarely have data of this quality, but I’d suggest an additional pair of questions to help you navigate this minefield:

  • If I were Google, would I do this?
  • If I were Google, could I do this?

Direct traffic as a ranking factor passes the “could” test, but only barely — Google could use data from Chrome, Android, or ISPs, but it’d be sketchy. It doesn’t really pass the “would” test, though — it’d be far easier for Google to use branded search traffic, which would answer the same questions you might try to answer by comparing direct traffic levels (e.g. how popular is this website?).

2. Missing the context

If I told you that my traffic was up 20% week on week today, what would you say? Congratulations?

What if it was up 20% this time last year?

What if I told you it had been up 20% year on year, up until recently?

It’s funny how a little context can completely change this. This is another problem with case studies and their evil inverted twin, traffic drop analyses.

If we really want to understand whether to be surprised at something, positively or negatively, we need to compare it to our expectations, and then figure out what deviation from our expectations is “normal.” If this is starting to sound like statistics, that’s because it is statistics — indeed, I wrote about a statistical approach to measuring change way back in 2015.

If you want to be lazy, though, a good rule of thumb is to zoom out, and add in those previous years. And if someone shows you data that is suspiciously zoomed in, you might want to take it with a pinch of salt.

3. Trusting our tools

Would you make a multi-million dollar business decision based on a number that your competitor could manipulate at will? Well, chances are you do, and the number can be found in Google Analytics. I’ve covered this extensively in other places, but there are some major problems with most analytics platforms around:

  • How easy they are to manipulate externally
  • How arbitrarily they group hits into sessions
  • How vulnerable they are to ad blockers
  • How they perform under sampling, and how obvious they make this

For example, did you know that the Google Analytics API v3 can heavily sample data whilst telling you that the data is unsampled, above a certain amount of traffic (~500,000 within date range)? Neither did I, until we ran into it whilst building Distilled ODN.

Similar problems exist with many “Search Analytics” tools. My colleague Sam Nemzer has written a bunch about this — did you know that most rank tracking platforms report completely different rankings? Or how about the fact that the keywords grouped by Google (and thus tools like SEMRush and STAT, too) are not equivalent, and don’t necessarily have the volumes quoted?

It’s important to understand the strengths and weaknesses of tools that we use, so that we can at least know when they’re directionally accurate (as in, their insights guide you in the right direction), even if not perfectly accurate. All I can really recommend here is that skilling up in SEO (or any other digital channel) necessarily means understanding the mechanics behind your measurement platforms — which is why all new starts at Distilled end up learning how to do analytics audits.

One of the most common solutions to the root problem is combining multiple data sources, but…

4. Combining data sources

There are numerous platforms out there that will “defeat (not provided)” by bringing together data from two or more of:

  • Analytics
  • Search Console
  • AdWords
  • Rank tracking

The problems here are that, firstly, these platforms do not have equivalent definitions, and secondly, ironically, (not provided) tends to break them.

Let’s deal with definitions first, with an example — let’s look at a landing page with a channel:

  • In Search Console, these are reported as clicks, and can be vulnerable to heavy, invisible sampling when multiple dimensions (e.g. keyword and page) or filters are combined.
  • In Google Analytics, these are reported using last non-direct click, meaning that your organic traffic includes a bunch of direct sessions, time-outs that resumed mid-session, etc. That’s without getting into dark traffic, ad blockers, etc.
  • In AdWords, most reporting uses last AdWords click, and conversions may be defined differently. In addition, keyword volumes are bundled, as referenced above.
  • Rank tracking is location specific, and inconsistent, as referenced above.

Fine, though — it may not be precise, but you can at least get to some directionally useful data given these limitations. However, about that “(not provided)”…

Most of your landing pages get traffic from more than one keyword. It’s very likely that some of these keywords convert better than others, particularly if they are branded, meaning that even the most thorough click-through rate model isn’t going to help you. So how do you know which keywords are valuable?

The best answer is to generalize from AdWords data for those keywords, but it’s very unlikely that you have analytics data for all those combinations of keyword and landing page. Essentially, the tools that report on this make the very bold assumption that a given page converts identically for all keywords. Some are more transparent about this than others.

Again, this isn’t to say that those tools aren’t valuable — they just need to be understood carefully. The only way you could reliably fill in these blanks created by “not provided” would be to spend a ton on paid search to get decent volume, conversion rate, and bounce rate estimates for all your keywords, and even then, you’ve not fixed the inconsistent definitions issues.

Bonus peeve: Average rank

I still see this way too often. Three questions:

  1. Do you care more about losing rankings for ten very low volume queries (10 searches a month or less) than for one high volume query (millions plus)? If the answer isn’t “yes, I absolutely care more about the ten low-volume queries”, then this metric isn’t for you, and you should consider a visibility metric based on click through rate estimates.
  2. When you start ranking at 100 for a keyword you didn’t rank for before, does this make you unhappy? If the answer isn’t “yes, I hate ranking for new keywords,” then this metric isn’t for you — because that will lower your average rank. You could of course treat all non-ranking keywords as position 100, as some tools allow, but is a drop of 2 average rank positions really the best way to express that 1/50 of your landing pages have been de-indexed? Again, use a visibility metric, please.
  3. Do you like comparing your performance with your competitors? If the answer isn’t “no, of course not,” then this metric isn’t for you — your competitors may have more or fewer branded keywords or long-tail rankings, and these will skew the comparison. Again, use a visibility metric.

Conclusion

Hopefully, you’ve found this useful. To summarize the main takeaways:

  • Critically analyse correlations & case studies by seeing if you can explain them as coincidences, as reverse causation, as joint causation, through reference to a third mutually relevant factor, or through niche applicability.
  • Don’t look at changes in traffic without looking at the context — what would you have forecasted for this period, and with what margin of error?
  • Remember that the tools we use have limitations, and do your research on how that impacts the numbers they show. “How has this number been produced?” is an important component in “What does this number mean?”
  • If you end up combining data from multiple tools, remember to work out the relationship between them — treat this information as directional rather than precise.

Let me know what data analysis fallacies bug you, in the comments below.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

How to Use the "Keywords by Site" Data in Tools (Moz, SEMrush, Ahrefs, etc.) to Improve Your Keyword Research and Targeting – Whiteboard Friday

Posted by randfish

One of the most helpful functions of modern-day SEO software is the idea of a “keyword universe,” a database of tens of millions of keywords that you can tap into and discover what your site is ranking for. Rankings data like this can be powerful, and having that kind of power at your fingertips can be intimidating. In today’s Whiteboard Friday, Rand explains the concept of the “keyword universe” and shares his most useful tips to take advantage of this data in the most popular SEO tools.

How to use keywords by site

Click on the whiteboard image above to open a high-resolution version in a new tab!

Video Transcription

Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re going to chat about the Keywords by Site feature that exists now in Moz’s toolset — we just launched it this week — and SEMrush and Ahrefs, who have had it for a little while, and there are some other tools out there that also do it, so places like KeyCompete and SpyFu and others.

In SEO software, there are two types of rankings data:

A) Keywords you’ve specifically chosen to track over time

Basically, the way you can think of this is, in SEO software, there are two kinds of keyword rankings data. There are keywords that you have specifically selected or your marketing manager or your SEO has specifically selected to track over time. So I’ve said I want to track X, Y and Z. I want to see how they rank in Google’s results, maybe in a particular location or a particular country. I want to see the position, and I want to see the change over time. Great, that’s your set that you’ve constructed and built and chosen.

B) A keyword “universe” that gives wide coverage of tens of millions of keywords

But then there’s what’s called a keyword universe, an entire universe of keywords that’s maintained by a tool provider. So SEMrush has their particular database, their universe of keywords for a bunch of different languages, and Ahrefs has their keyword universe of keywords that each of those two companies have selected. Moz now has its keyword universe, a universe of, I think in our case, about 40 million keywords in English in the US that we track every two weeks, so we’ll basically get rankings updates. SEMrush tracks their keywords monthly. I think Ahrefs also does monthly.

Depending on the degree of change, you might care or not care about the various updates. Usually, for keywords you’ve specifically chosen, it’s every week. But in these cases, because it’s tens of millions or hundreds of millions of keywords, they’re usually tracking them weekly or monthly.

So in this universe of keywords, you might only rank for some of them. It’s not ones you’ve specifically selected. It’s ones the tool provider has said, “Hey, this is a broad representation of all the keywords that we could find that have some real search volume that people might be interested in who’s ranking in Google, and we’re going track this giant database.” So you might see some of these your site ranks for. In this case, seven of these keywords your site ranks for, four of them your competitors rank for, and two of them both you and your competitors rank for.

Remarkable data can be extracted from a “keyword universe”

There’s a bunch of cool data, very, very cool data that can be extracted from a keyword universe. Most of these tools that I mentioned do this.

Number of ranking keywords over time

So they’ll show you how many keywords a given site ranks for over time. So you can see, oh, Moz.com is growing its presence in the keyword universe, or it’s shrinking. Maybe it’s ranking for fewer keywords this month than it was last month, which might be a telltale sign of something going wrong or poorly.

Degree of rankings overlap

You can see the degree of overlap between several websites’ keyword rankings. So, for example, I can see here that Moz and Search Engine Land overlap here with all these keywords. In fact, in the Keywords by Site tool inside Moz and in SEMrush, you can see what those numbers look like. I think Moz actually visualizes it with a Venn diagram. Here’s Distilled.net. They’re a smaller website. They have less content. So it’s no surprise that they overlap with both. There’s some overlap with all three. I could see keywords that all three of them rank for, and I could see ones that only Distilled.net ranks for.

Estimated traffic from organic search

You can also grab estimated traffic. So you would be able to extract out — Moz does not offer this, but SEMrush does — you could see, given a keyword list and ranking positions and an estimated volume and estimated click-through rate, you could say we’re going to guess, we’re going to estimate that this site gets this much traffic from search. You can see lots of folks doing this and showing, “Hey, it looks this site is growing its visits from search and this site is not.” SISTRIX does this in Europe really nicely, and they have some great blog posts about it.

Most prominent sites for a given set of keywords

You can also extract out the most prominent sites given a set of keywords. So if you say, “Hey, here are a thousand keywords. Tell me who shows up most in this thousand-keyword set around the world of vegetarian recipes.” The tool could extract out, “Okay, here’s the small segment. Here’s the galaxy of vegetarian recipe keywords in our giant keyword universe, and this is the set of sites that are most prominent in that particular vertical, in that little galaxy.”

Recommended applications for SEOs and marketers

So some recommended applications, things that I think every SEO should probably be doing with this data. There are many, many more. I’m sure we can talk about them in the comments.

1. Identify important keywords by seeing what you rank for in the keyword universe

First and foremost, identify keywords that you probably should be tracking, that should be part of your reporting. It will make you look good, and it will also help you keep tabs on important keywords where if you lost rankings for them, you might cost yourself a lot of traffic.

Monthly granularity might not be good enough. You might want to say, “Hey, no, I want to track these keywords every week. I want to get reporting on them. I want to see which page is ranking. I want to see how I rank by geo. So I’m going to include them in my specific rank tracking features.” You can do that in the Moz Keywords by Site, you’d go to Keyword Explorer, you’d select the root domain instead of the keyword, and you’d plug in your website, which maybe is Indie Hackers, a site that I’ve been reading a lot of lately and I like a lot.

You could see, “Oh, cool. I’m not tracking stock trading bot or ark servers, but those actually get some nice traffic. In this case, I’m ranking number 12. That’s real close to page one. If I put in a little more effort on my ark servers page, maybe I could be on page one and I could be getting some of that sweet traffic, 4,000 to 6,000 searches a month. That’s really significant.” So great way to find additional keywords you should be adding to your tracking.

2. Discover potential keywords targets that your competitors rank for (but you don’t)

Second, you can discover some new potential keyword targets when you’re doing keyword research based on the queries your competition ranks for that you don’t. So, in this case, I might plug in “First Round.” First Round Capital has a great content play that they’ve been doing for many years. Indie Hackers might say, “Gosh, there’s a lot of stuff that startups and tech founders are interested in that First Round writes about. Let me see what keywords they’re ranking for that I’m not ranking for.”

So you plug in those two to Moz’s tool or other tools. You could see, “Aha, I’m right. Look at that. They’re ranking for about 4,500 more keywords than I am.” Then I could go get that full list, and I could sort it by volume and by difficulty. Then I could choose, okay, these keywords all look good, check, check, check. Add them to my list in Keyword Explorer or Excel or Google Docs if you’re using those and go to work.

3. Explore keywords sets from large, content-focused media sites with similar audiences

Then the third one is you can explore keyword sets. I’m going to urge you to. I don’t think this is something that many people do, but I think that it really should be, which is to look outside of your little galaxy of yourself and your competitors, direct competitors, to large content players that serve your audience.

So in this case, I might say, “Gosh, I’m Indie Hackers. I’m really competing maybe more directly with First Round. But you know what? HBR, Harvard Business Review, writes about a lot of stuff that my audience reads. I see people on Twitter that are in my audience share it a lot. I see people in our forums discussing it and linking out to their articles. Let me go see what they are doing in the content world.”

In fact, when you look at the Venn diagram, which I just did in the Keywords by Site tool, I can see, “Oh my god, look there’s almost no overlap, and there’s this huge opportunity.” So I might take HBR and I might click to see all their keywords and then start looking through and sort, again, probably by volume and maybe with a difficulty filter and say, “Which ones do I think I could create content around? Which ones do they have really old content that they haven’t updated since 2010 or 2011?” Those types of content opportunities can be a golden chance for you to find an audience that is likely to be the right types of customers for your business. That’s a pretty exciting thing.

So, in addition to these, there’s a ton of other uses. I’m sure over the next few months we’ll be talking more about them here on Whiteboard Friday and here on the Moz blog. But for now, I would love to hear your uses for tools like SEMrush and the Ahrefs keyword universe feature and Moz’s keyword universe feature, which is called Keywords by Site. Hopefully, we’ll see you again next week for another edition of Whiteboard Friday. Take care.

Video transcription by Speechpad.com

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Google test surfaces user data for publishers as part of new Insights Engine Project

Several new initiatives are aimed at bringing machine learning into publisher products and offering solutions for driving subscriptions. The post Google test surfaces user data for publishers as part of new Insights Engine Project appeared first on Search Engine Land.

Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Google adds structured data for subscription & paywalled content for new flexible sampling program

Excited for the new flexible sampling program for Google web search and Google News? Well, make sure you don’t get in trouble for cloaking by using this new structured data. The post Google adds structured data for subscription & paywalled content for new flexible sampling program appeared…

Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

The non-developer’s guide to reducing WordPress load times up to 2 seconds (with data)

Wondering where to start with page speed improvements? Columnist Tom Demers shares how he tackled page speed improvements on several WordPress sites without (much) input from a developer. The post The non-developer’s guide to reducing WordPress load times up to 2 seconds (with data) appeared first…

Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Announcing 5 NEW Feature Upgrades to Moz Pro’s Site Crawl, Including Pixel-Length Title Data

Posted by Dr-Pete

While Moz is hard at work on some major new product features (we’re hoping for two more big launches in 2017), we’re also working hard to iterate on recent advances. I’m happy to announce that, based on your thoughtful feedback, and our own ever-growing wish lists, we’ve recently launched five upgrades to Site Crawl.

1. Mark Issues as Fixed

It’s fine to ignore issues that don’t matter to your site or business, but many of you asked for a way to audit fixes or just let us know that you’ve made a fix prior to our next data update. So, from any issues page, you can now select items and “Mark as fixed” (screens below edited for content).

Fixed items will immediately be highlighted and, like Ignored issues, can be easily restored…

Unlike the “Ignore” feature, we’ll also monitor these issues for you and warn you if they reappear. In a perfect world, you’d fix an issue once and be done, but we all know that real web development just doesn’t work out that way.

2. View/Ignore/Fix More Issues

When we launched the “Ignore” feature, many of you were very happy (it was, frankly, long overdue), until you realized you could only ignore issues in chunks of 25 at a time. We have heard you loud and clear (seriously, Carl, stop calling) and have taken two steps. First, you can now view, ignore, and fix issues 100 at a time. This is the default – no action or extra clicks required.

3. Ignore Issues by Type

Second, you can now ignore entire issue types. Let’s say, for example, that Moz.com intentionally has 33,000 Meta Noindex tags (for example). We really don’t need to be reminded of that every week. So, once we make sure none of those are unintentional, we can go to the top of the issue page and click “Ignore Issue Type”:

Look for this in the upper-right of any individual issue page. Just like individual issues, you can easily track all of your ignored issues and start paying attention to them again at any time. We just want to help you clear out the noise so that you can focus on what really matters to you.

4. Pixel-length Title Data

For years now, we’ve known that Google cut display titles by pixel length. We’ve provided research on this subject and have built our popular title tag checker around pixel length, but providing this data at product scale proved to be challenging. I’m happy to say that we’ve finally overcome those challenges, and “Pixel Length” has replaced Character Length in our title tag diagnostics.

Google currently uses a 600-pixel container, but you may notice that you receive warnings below that length. Due to making space to add the “…” and other considerations, our research has shown that the true cut-off point that Google uses is closer to 570 pixels. Site Crawl reflects our latest research on the subject.

As with other issues, you can export the full data to CSV, to sort and filter as desired:

Looks like we’ve got some work to do when it comes to brevity. Long title tags aren’t always a bad thing, but this data will help you much better understand how and when Google may be cutting off your display titles in SERPs and decide whether you want to address it in specific cases.

5. Full Issue List Export

When we rebuilt Site Crawl, we were thrilled to provide data and exports on all pages crawled. Unfortunately, we took away the export of all issues (choosing to divide those up into major issue types). Some of you had clearly come to rely on the all issues export, and so we’ve re-added that functionality. You can find it next to “All Issues” on the main “Site Crawl Overview” page:

We hope you’ll try out all of the new features and report back as we continue to improve on our Site Crawl engine and UI over the coming year. We’d love to hear what’s working for you and what kind of results you’re seeing as you fix your most pressing technical SEO issues.

Find and fix site issues now

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog