My issues with client-side data collection

Having worked in a field for sufficient time now, I have always been bothered by the design pushed by Tag Management Systems to collect data on the client. I felt that I needed to list down somewhere those issues, so here it is:

Time to insight

The first and most painful caveat I have experienced on the two side of the fence (consulting and final customer) was the time it takes to implement your tagging plan on site. Implementers are usually the developers, with their own agenda and little to no incentive to make the implementation happen fast, if at all. This leads to great delays between the discovery of an interesting additional data point to track and the actual collection taking place.

Fragility

But let's say your fancy new dimension has been added, yay! As any piece of software, this can break and it will usually, even quite often depending on how fast the developers are touching the code base. And then what happens? Will you get a nice little notification when the page type dimension comes out empty on the product page XYZ ? While you could set up some monitoring system, the reality of the industry is, as I have experienced it, that people usually don't. Because existing monitoring solutions are a pain to work with, usually too expensive for analytics team to justify the premium to executives and -fundamentally- most people do not really care. Companies lose data on a daily basis, only to realize it weeks or months after when they try to build an analysis with it. This makes the whole thing clearly fragile.

Rigidity

I have never been quite of a planning person, more on the improvisation kind of thing. Because I always feel that, even if some things can be planned, you cannot anticipate everything. Life can be creative and you have to deal with its fantasies rather than the other way around. But with analytics there's really no room for adaptability: you define dimensions in advance, events taxonomy and you send all of that in some constrained SQL schema which is so complex few people actually understand how to extract insights (search for GA4 and BigQuery on Google to see just how many articles share different queries to reproduce the same thing) And the latter is when you have access to the underlying schema, which is not always the case. This comes from a pragmatic approach: one cannot track everything and the use case should be clearly defined to allow the data to be collected and analyzed correctly later on. But this also prevents you from making discoveries, as you are only tracking known usages of features instead of the behaviour. Few may argue that UX tools are a better fit for that kind of discovery driven analysis, and they would be true. But you now have two different tools, with usually two different teams pursuing different goals.

Data duplication

What's so funny is that, often, most fo the requested data is actually already one API call away. Adding to a client-side data structure like the dataLayer increments the overhead of supplemental logic on the front-end side, and in the worst case, useless complexity (think getting a piece of information that is not normally available on that page) One so so alternative is to use javascript functions that will go scrape the content of the page at run time. But this fails short when the page structure and styling change often, breaking previously working css selectors.

Leakage

Your business data is precious, still, it is not uncommon to see customer classification IDs, membership types or other potential clues for competitors or even malicious actors to guess inner workings of your business. This sounds strange, but give it a try: inspect the dataLayer object from any major e-commerce websites and you may be surprise of the quantity of information you may find in it.

The root cause

What's so sad here is that is, as often, not a technical issue. It would make sense to collect almost nothing on the client and enrich later on through API, by using the visitor id as some sort of foreign key. I have seen some customers do it with great success. The issue, in my eyes, come from the fact that web analytics and tracking are usually managed by teams that do not get much chance to interact with other data teams. Web analytics is siloed and rarely looked at. And the fact that it is often collected inside external tools than the company's data stack make it even harder to raise interest internally. This makes the analytics team left alone to carry out the implementations, while usually not being staffed equipped with technical skills people.

The solution

Anything that could make the sparsed data collected across a company come into a unified view would be the solution. And to that end, I feel that any solution that could make it easier for analysts and non technical people able to link visitors behaviours with other data sources in some sort of low code fashion would greatly help. Also, it should happen on the server only, minimizing the footprint on the client to avoid impacts on web perf and data leakages mentioned above. I am exploring that kind of solution right now with Dorsia, an integration free and automatic data collection system leveraging Google Sheet to replace the client-side data layer. Feel free to give it a look, it's early beta for now but we're moving fast!