And How Analyst and Tools Attempt to Skirt Them

Axioms are suppose to be immutable and unchallengeable, but hey, in this day and age is there anything that does not go unchallenged?  Axioms are assumptions that can not be proven but if true then subsequent theorems and proofs follow.  In the case of web analytics, the data and what it represents follows from a specific set of assumptions from which all subsequent metrics and reports follow.  These assumptions cannot be proven true and are often challenged as not true, but none the less – all measurement and analysis that  depends upon this data must accept these assumptions as the initial axioms of their analysis.

I realize that when forming analogies, it is best to compare and map the unknown (in this case Web Analytics) into something that the audience is very familiar with (Euclidean Geometry – duh?).  OK! You are probably more comfortable with Web Analytics (as uncertain as that is) than that old Plane Geometry course in High School. But look! With Geometry everything seemed to fit together using nothing more than a handful of assumptions called axioms and logic. With this one could perform amazing feats of deduction with things called theorems and proofs.  It is after all simple enough to teach a 10th grader with hormones in full surge.  Is it possible that Web Analytics – any analytics – could be capable of the such feats of logic and marvels of insight? Could it be as simple?

I’ll assume that you at sometime took a high school coarse in Plane Geometry and may have even complained at the time – how does this have anything to do with life in the real world?  I’ll also assume that early in the coarse, before you had this gestalt and succumbed to other urges,  you were introduced to the 5 Euclidean Axioms. Does “the shortest distance between two points is a straight line” ring a bell?  The axioms are fundamental assumptions that have to be accepted as true without proof or otherwise it would not be Plane Geometry.  For example, if we stated that the shortest distance between two points was a great circle we would have spherical geometry or TSDBTP is a geodesic: Riemannian Geometry.  There are not good and bad geometries, just different geometries because of different axioms.

The exact same thing occurs when dealing with data from different sources. In the same sense, there is not good and bad sources but different sources because of different assumptions in collecting and processing data. What are the fundamental assumptions that are the axioms that have to be accepted that are implicit in the data?  What can be derived (the analogs to theorems and proofs) from that data if these axioms hold true? This will be the central theme of this discussion.

When dealing with data sets from different sources, we need to determine if the assumptions are the same for the various sources and if not whether or not there exist mappings among the data sets (geometry).  What makes web analytic data the same or different from other sources of data such as data from transaction engines, CRM, financials, marketing surveys and panels, or any other source of marketing / business intelligence? It seems whenever there is a need for data, there will be different potential sources that will pop-up as options to consider. So how does one slug through and determine the best source or combination to meet the need?  The following gives a starting point for understanding how web analytic data is different from other sources and an approach for evaluating alternative data sources.

Background

This will not be some silly philosophical mind bender. What is presented here has been field tested over a long time.  These axioms came about when I was developing web analytic tools and environments that were a precursor to many of the tools that are available today. For a time I worked with a nefarious and sometime devious PM, who would come to me with fantastic feats achieved by the competition.  He would claim that a competitor was able to identify users without having to set cookies (or requiring any particular effort from the customer)!  Another was able to set parameters without “tagging”! Of course none of this was true and the purpose was to make sure that I was not adding unnecessary steps in the setup of analytic environments. At the time, the effort necessary to setup web analytic tools was and continues to be an issue. Today there may more acceptance that to get data vital to internal business decisions requires diligent planning, implementation and monitoring.  Back then we were in the “show me” phase where cookies and tagging were barriers to acceptance.

It did get me to think what was fundamental to web analytics and why the effort had to be what it is.  I could then come back and say: “That’s impossible because it violates the second axiom with respect to cookies or axioms 1, 3, and 5 for tags.” This would save time having to root through the details to show that he was wrong.  I tried to be fair so I only had 5 axioms and if he had a proposal that was consistent with these 5 then it was worth further investigation and discussion.

Later when working within a large corporation at Yahoo! where there many philosophies and approaches to collecting and processing data, these axioms became invaluable to expressing how web analytic data was different and contributed to the business needs of the company and its customers.  Heck! At one point I even had to build financial models and P&L (Profit and Lost) Statement for collecting and processing specific data. Actually got it approved by Financial Ops as a potential profit center!?! This information did not necessarily bring consensus, opinions can be strongly held and difficult to change. It did provide a framework for discussing issues among various business units and stand ground when the fans were being hit.

The Five Axioms, No More, No Less

There are 5 Axioms that are assumed and accepted by all web analytic providers (WAP) in the market today and that distinguish the data from all other sources of data.  It is because of this congruence (oops another geometry term) that one can more or less move from one vendor to another and get pretty much the same results.  The difference one might find arises from how well the vendor addresses the complexities or attempts to ignore the finer implications of each axiom.  Like most axioms, each seems self evident on the surface but underneath present constraints and complexities that must be handled to generate consistent quality data.

Every metric or KPI you define will ultimately depend upon compliance to these axioms.  The following presents each defined in its simplest form and typical complexities and violations one may encounter that must be addressed in a site optimization plan.

Axiom 1: The owner of the data is uniquely identified.

The simplest expression of this axiom is the account identifier that one receives when signing up with a specific vendor.  This is used to establish T&Cs and SLAs between the vendor and customer as well as how the service will be budgeted and billed.  Typically associated with the account are users that can access the account and the data collected for the account.  These users have roles and privileges that can be different in both perspective and scope.  It is around these roles that some vendors allow the customer to setup different report suites and tools for the different audiences within a company or allow each to customize their reports based upon access to the same data. However one must not confuse ownership administrated by account from the multiple perspectives allowed by the owner via reporting suites.

In a majority of cases this axiom is satisfied by equating the account ID with the business ID of the company where everyone in the company has access to the same data.  However in many cases this is not sufficient because of access controls and protections the company may place on specific data fields and views that a particular web analytic provider may not sufficiently control.

Another reason is that different owners such as business units may want to collect different data that is germain to their business but may not be to any other BUs.  This is particularly true when BUs operate independently in both business model and marketing optimization approach.   Regardless, there are corporate interests that require specific measures and metrics that are uniformly enforced over all business units to allow comparison and attribution of company wide performance and can be linked back to the different business models that are providing the higher tier metrics or key performance indicators (KPIs).

One work around that introduces further complexity is for a company to establish multiple WAP accounts that keeps the data separated for each owner.  In these cases, since the company ultimately owns all the data, the vendor must provide a means of “rolling up” reports to allow a consistent summary view of the data over all accounts. This potentially compromises the quality of the data in that much of the data (owned by the company) is left within the business unit accounts with the possibility that the trace from the rolled up view to the account view has been lost. One must have the ability to drill down from the summary view to the details in the BU view.

The practical implication of this axiom is that every piece of data that is collected for the account must have the account ID associated with the data.  This means that for client-side (browser) collection of web analytic data requires that the server inserts a collection tag on every page viewed by the visitor that includes the account ID.  Hence the fundamental need for tagging and instrumentation to collect data.

On the server side, the same rule applies where all data must ultimately link to a unique owner.  For dynamically generated pages and web flows, determining the owner becomes an integral aspect of the presentation layer that serves and brings together the page content.  Hence the only difference in computation and processing is whether the data is collected from server side logs or client side from tagged beacons.

For most large web sites, the content viewed in a page will come from sources that are developed, managed and monitored by different groups within the company.  Each group will want to collect and control the analytic data related to their content.  Therefore it becomes important to breakdown the owner-ship or mutual interest in collected data as it relates to the different sources for the content on each page. In these cases, additional tag variables must be included and additional back-end processing rules introduced to implement a data ownership (control) scheme among all the interested parties.

You have complied with this axiom if access to all fields and relationships complies with company policy and governance while at the same time not compromise the overall integrity and unity of the data collected.  If the data becomes an aspect of a business process requiring Sarbanes-Oxley (SOX) compliance then the ownership and control requirements become even more stringent and may go beyond adapting to particular vendor capability.  Likewise the company’s privacy and data governance policies must also be considered and enforced which may require separate enforcements related to ownership relationships.  For example, areas of a site that require the user to sign-in will have different policy issues than the anonymous areas of the site.

It is important that development of the ownership / governance planning be independent of the site architecture and design.  The latter, in an ideal world, should come from the policy planning. Not the other way around, even though that may be the case. It best at that point to pretend that policy drives design for sake of future generations that may follow. There will be another axiom (3) latter that covers sites. Then it will become more clear what issues will arise if you attempt to equate sites with ownership.  If you encounter a solution that equates user-ID with account-ID or account-ID with site-ID, run don’t walk away from that tool.

Axiom 2: The client source of the data is uniquely identified.

Here client is interpreted as either an individual or a user-agent such as mail app or browser.  The ideal would be to track individuals, but to perform this feat requires explicit action or opt-in by the user such as sign-in or installing a tracking plug-in such as a toolbar where the individual can be uniquely identified independent of the user-agent (assuming he or she has installed the plug-in on all his or her browsers).  Though this may be deemed the most detailed data, the implementation often leaves gaps in the action stream (what did the user do before signing-in?) or is a highly self-selecting non-random sampling (Oh Boy! I am a member of a panel!).

The key aspect of web analytic collection is that the identification is anonymous and applied uniformly over all visitors so that the events are random and amenable to measurement and analysis.  Since user-agents do not identify themselves uniquely, the implication is that a cookie must be set by the site or WAP that anonymously identifies the visitor while on the instrumented web site.  However, as many know and are quick to point out, not all user-agents allow cookies to be set and some users implement privacy policies that disallow some cookies being set.  In these cases, it becomes important to detect and designate these cases as Cookies-Disabled visitors and separate them from the clients that can be uniquely identified.

The clients that can be uniquely and anonymously identified conform to the second axiom. Those that can’t do not.  This does not mean that the data is bad or unusable. Only that the data collected from the Cookies-Disabled visitors follows a different geometry.  Any WAP solution must be able to handle and work between these two “geometries”.

For awhile I worked for the Navy on a IFF tracking system where IFF stand-for Interrogate Friend or Foe.  The idea is to send an encrypted signal to all aircraft.  All the aircraft that transmit back an identifier are friends and those that do not are foe.  The same applies here, by allowing a cookie to be set with the proper user-agent designated, the client can be properly tracked with a high likelihood of being a real person or “eye-ball”.  Those that do allow themselves to be identified may have many different reasons with some being innocent while others more clandestine.

Returning to the tracking analogy of a previous blog, the tracker must be able to identify and track everything. So if we are tracking an individual and a herd of chickens followed by a pack ninnies crosses our path, we need to separate the different trails to return to the trail we in which we are interested.  So the tracker attempts find patterns that allow her to group and separate the contributions.

In this case an identifier is created (without cookies) that includes a number of fields that distinguishes the visitors not perfectly but sufficiently.  For example, combining IP-Address, Time-Zone and User-Agent can help track a visitor for short periods.  All these fields can be spoofed by a determined user-agent so one has to continue to look at other patterns such as IP-Clusters at a specific time (denial of service) or loading every page on your web site (scrapper).  Where as IT will attempt to put in a defense that blocks these accesses before they enter the site, WAP attempts to track all the sources and includes sophisticated monitoring (hopefully) to separate the foes from the innocent privacy concerned users.

The general rule of thumb is assume that most are behaving well by properly setting HTTP request fields and following appropriate protocols.  For example, web crawlers will properly identify themselves in the user-agent field.  Assume the best but verify!  From my experience the number of cookies-off visitors is small (< 5.0%).  I tend to not believe reports of user’s with cookies turned-off greater than 10% unless the site markets spyware tools. What I believe occurs is something I found early on, that if each page is treated as a cookies-off visit, then these visitors overwhelm session and visitor counts and percentages. So the actual number depends on how well these visitors can be individually identified and tracked. Some browsers do have default settings that can prevent cookies being set by third parties but all allow as a default first party cookies with appropriate compact privacy policies. Therefore most vendors provide a first party cookie approach for uniquely identifying visitors.  You can quickly verify that default browser settings are not blocking your data by checking the browsers / operating system report and confirming that you are collecting from all sources.

The primary way that this axiom is violated by WAP tools is by not separating the two visitor populations in the reporting or assuming that visitor identifiers assigned never change.  Some providers such as Coremetrics provide thorough and elaborate processing to handle cookie churn and maintain visitor identities throughout their lifetime. On the other hand I still find tools that make it difficult to separate Cookie-Disabled from Cookie-Enabled visitors.  As a rule, consider Cookie-Disabled visitors as THE first (must have) visitor segment.

Related to setting a Visitor-ID is setting a Session-ID that is a cookie with a limited time-to-live (TTL) that mimics how sessions were defined in the day of  counting click streams from server logs. As you will see, I am not an enthusiastic advocate for session metrics, but having an ability to determine if an event such as a call to action occurred within the same session as a marketing ad click is a strongly correlated conversion metric. Therefore some events  should be attributed to actions within sessions while others may span over multiple visits – called visitor based conversions. For example, research shows that a visitor will perform 7 to 14 searches before purchasing on a particular site over 1 to 4 weeks depending upon the amount of the purchase. A particular site may receive several of these search referrals, which can be legitimately attributed to the purchase event. The site must definitely track and understood these from the visitor perspective to optimize their marketing spend.

Axiom 3: The time of any event is known precisely and universally.

The primary distinguishing feature of web analytic data is that it is event data. Even more than that it is events associated with individual users from which we can reconstruct event time-lines and paths through a web site.  So its not surprising that a significant key is the time-stamp for the event. Precisely means we know the time with sufficient precision to determine the order of events and universally means that we can observe the order exactly as it occurred on the user client. Seems simple and straightforward.

We assume that the way we have identified visitors anonymously without requiring opt-in or self selecting action on their part, that all measurements of time and counts conform to the Central Limit Theorem allowing time to be measured as normal distributions with mean and variance.  Hence this becomes the basis for all our measurements and metrics. However because the measures are distributed throughout the web and the web is an open system, where any number of uncontrolled processes and events can occur. These systematic events may invalidate this assumption. Sometimes these errors in the low level measurement processing can become visible in the higher level reporting where the user is lead to believe he is observing a rational closed system – his web site.

We have the same duality in the tracking analogy. The tracker is putting together evidence from which he interprets direction and intent. If the captain trusts his tracker, he accepts the interpretation without having to completely understand or appreciate the risk and uncertainty. However when this trust is lost, the captain goes searching for another tracker.

In the ideal world the visitor goes from page one to page two, but in an open system such as the world wide web, data from the page two can arrive before data from page one for a number of reasons such that an the second page, which clearly an internal page, can become on rare occasions, the session entry page.  One might ask, why not measure time from the client, but being an open system all clients will have their clocks set at different times and at any given time there are a number of users resetting their clocks! What happens when the servers collecting the data are not properly time synced?  Needless to say that even under the best conditions this axiom will occasionally be violated and the distinction between great tools and average tools can be determined in part by how well the tool identifies and counters situations that may violate this axiom.

Of coarse anytime you have a data set where events have been aggregated or quantized into time bins that violate the precision and universality conditions, you have a data set that is very different from a web analytic data set.  Again this is not to say that the data is wrong or bad, but that it requires different analysis and supports different derivations.

Axiom 4: What is internal and external to a web site or web property is unambiguously defined.

We are almost through but here things get wild and woolly. We start with a simple question. How do we know when an event occurs internal to a web site or external?  I suppose if you received the data you would assert that you are 100% certain that the data came from your website since you do not have control or ownership of data external to your site.  (Oh you are so innocent.  What about another site using your site as a template dutifully incorporating your tag in their site?  Can that happen? All the time! Don’t get me started. But I digress. )  With each HTTP request there is a Referer-Field (sic) that includes the URL of the page making the request.  These can be external to the web site and provide invaluable information on how the visitor came to your site or particular page content.  So how do you know that URL is external to your site?

I tend to distinguish web sites from web properties by identifying in the simplest case the web site with major domain in the URL and the property with the sub-domain.  My assumption is that there is reasoning in the domain assignments such that sub-domains represent major divisions in the site where one is interested in tracking how visitors come to the property from other properties within the site as well as from external sites.  Some in marketing would refer to these as internal and external marketing channels.  For example, how many visitors to a property came by way of internal search identified by search.mysite.com. Here search is an internal channel.

It is extremely rare if not impossible to setup and maintain this kind of discipline.  Sometimes the property will have its own domain or part of checkout is contracted to an external site (e.g. PayPal) but should be considered an internal property in the business work flow.  Sometimes the home page will have hundreds of domains mapped to it, other times the HTTP full domain never changes and the property is identified by an obscure parameter in the path or query part of the URL.  All these factors and many more make up the wild part of trying to comply with this axiom. You will have to counter this wildness by first establishing the marketing and business interest in understanding external and internal channels and then working through a means of identifying these channels in the data streams, typically through tags added to URLs.

The woolly part comes from attempting to use this information in real time.  Like discovering a Woolly Mammoth, we will be able to eventually extract from the collected evidence what happened.  However also like the Woolly Mammoth, this information may be ancient history of an extinct species. The information is invaluable if it is utilized in a timely manner.  For example, one can target content to a visitor if we know that the visitor came in response to a specific marketing campaign ad. So how do we conform with this axiom in real-time behavior targeting?

To address this question, let us assume that all the information that we know about the visitor comes with the HTTP request (ignoring for the moment the possibility that visitor data can be retrieved by the server). This consists primarily of the Time-Stamp, Requested URL often called the Landing Page URL, Referer-URL called the Referring Page URL, User-Agent and Cookie Values.  When it can be determined that the Referring Page URL is external to the Landing Page URL, then the request header is declared an Introduction and represents a boundary crossing into web site or property.  The introduction is an opportunity to market the visitor and engage the visitor in the site.

From the fields associated with the introduction, we need to determine where and how the visitor came to the web property.  Typically the landing page URL is a tracking URL that includes parameters that identify the marketing campaign for a paid Ad channel.  So part of the woolly process is having consistent discipline in assigning tracking parameters to all forms of marketing campaigns and have consistent plan for tracking all types of marketing channels.  The referral URL may also have valuable information including references and links to events that occurred prior to coming to the property, such as search on a keyword, placement within the results page, or content context of the referring page.  So another woolly process is developing affiliate and partner relationships that allow for understanding events before the referral click that can be communicated through the Referer-URL field.

Critical to all this is the ability to distinguish at presentation time what is internal and external to your site or property.  This is at the core and is the essence of targeting – having actionable new information at the time of content serving and presentation.

To be able to conform to this axiom takes a great deal of planning, discipline and tenacity. You can typically find violations rather quickly, if you look at the session domain report.  This report typically gives the referring domain for the first URL that initiated the session called the session entry page. If the referrals appear to be all external to the site or property then the tool properly understands and enforces the internal / external boundary.  If this does not occur, then the discipline is not there to implement a targeting approach.   My position is that sessions and session entry and exit are mostly meaningless and of little value, though the major WAP tools seem to obsess about these measures.  Since my view on this subject is contrarian and controversial, this will have to be the subject of another blog.

Suffice it that if one can create accurate and timely Introductions that conform to the boundary of a site or property, then the data set complies with this axiom.  From that point one can work to extract the segments that distinguish the different channels for visitors coming to your site. How this is done is covered in the next and final axiom.

Axiom 5: Data that is the same will be interpreted the same.

This axiom has been encountered and discussed before.  At first glance this appears similar to the quote from Buckaroo Bonzai, “Wherever you are, there you are.”  However sometimes the obscure axiom can be the most profound.  From this axiom comes the entire concept of tagging and the motivation to instrument web pages to collect and capture visitor behavior.

Looking at the fundamental keys we have accumulated from the other axioms – Business-ID/Account-ID/User-ID, Visitor-ID/Session-ID, Time-Stamp, and Site-ID/Property-ID, we seem to be missing a important fundamental identifier – the Page-ID / Content-ID.  The page and page-view are typically identified by the landing URL.  The page represents a specific downloading of content, and page-view the viewing of that content, potentially multiple times during a browser session.

Using URLs to identify pages was more appropriate in the far past when when pages were more static. Today more that not, content within a page changes dynamically without loading a new page and the same page URL can vary in content among visitors based upon their characteristics and preferences.   The identification of a page and its content must be dynamic as well.  Apart from treating each page view as a different event, how does one segment and track the various treatments over all visitors?

If we apply this axiom, then when the URL does not change, all the page-views associated with the URL are interpreted as the same, even when we know they are different.  Looking at this from the web analytic collection perspective using a beacon URL, the URLs will appear the same unless there is something added or different to distinguish the different content.  This is done by adding tag parameters to the URL or tag variables in the page that identifies the differences and distinguishes the beacon request URLs with different segments. It is not enough to assign a unique identifier to each page view unless the identifier can expanded into data dimensions that allow the page to be segmented in different ways. A process that Omniture referrers to as classification.

As we found in the case of tracking plug-ins for browsers, it is impossible to distinguish events generated from these plug-ins from normal browser actions unless the plug-in provides information to the server that it has contributed to the request.  In the absence of this additional contextual data, the data will be the same and will be interpreted as the same as normal browser request.

So the implications of this axiom are first generate a pattern that distinguishes the behavior we want to capture and then make sure that every time the behavior occurs it will be interpreted the same because it appears the same.  The reason for the awkward framing of the implications as distinguishing patterns is that semantic meaning of the differences does not necessarily need to collected or communicated.

Drawing from another discipline – Genetic Algorithms – what we are referring to a genetic encoding where the bits either individually or in patterns map to various segments. If we are able to achieve such an encoding then one can apply and adapt to the codes without having to communicate all the potential personal identifying information associated with those codes. The receiver should be able to separate and segment behaviors that are correlated to actions such as conversions or use the variations to perform cluster analysis and classification to discover new target segments.

This is the basis of all the adaptive testing and optimization tools in market today using Multivariate Testing, Bayesian Classifiers, Wisdom of the Crowd, Neural Networks, or Gradient Decent.  If variations in the data stream are correlated to behaviors in the real world then one has the potential for extracting and tracking these behaviors in the data set. This is the true implication of the fifth axiom.  It is an implication that has been confirmed many times in many different ways with uncanny effectiveness.

Summary

This gave a brief introduction into the 5 axioms from which fundamental keys of web analytic data sets naturally arise.  These include:

1. Account-ID / Business-ID / User-ID
2. Visitor-ID / Session-ID
3. Time-Stamp
4. Site-ID / Property-ID
5. Page-ID / Content-ID / Source-ID

From the application of the fifth axiom also come all the other tag parameters that must be added to distinguish all of the ways that visitors can be segmented and their behavior tracked. The axioms lead directly to the primary keys of the web analytic data schema and what these keys represent if the axiom is valid for the data set.  So to compare different data sources, one want to see if they shared the same keys and how shared keys can be linked, but more importantly have rigorously the keys adhere to the axioms.

There are cases where analyst and WAP tools attempt to skirt these axioms by forming equivalences among the identifiers or by not sufficiently dealing with the complexities or nuances that may violate these assumptions.  If a data set conforms with these axioms then it will support the various derivations that we call web analytics as well as provide consistent and quality data that can drive marketing and site optimization. This is fundamental to the data itself.

Now we need to develop the analogs of theorems and proofs that build out the capability for interpreting and deriving information from this data, such as visitor state transitions more commonly referred to as conversions, or fundamental metrics such as page-views, conversions rates, bounce rate, stickiness, engagement, or life-time value.  These will be topics of other discussions.  At least now there is a starting point for these discussions – the intrinsic characteristics of the data itself.