Show HN: Cookieless Conversion Attribution with Pathview

pathview.io

17 points by shanebellone 2 years ago

Morning HN.

I worked on a cloud CMS and then pivoted an analytics feature to a standalone SaaS. Pathview focuses on the conversion path rather than general analytics. I want to help users optimize conversions.

It doesn’t use cookies, contains 160-characters of JavaScript, and leverages HTTP Messaging for a modern take on an old-school analytics approach.

I’m close to launching a public beta test and could use a sanity check. Any advice, feedback, or questions for this first-time developer?

-sb

mtmail 2 years ago

The Javascript inserts an image 'https://pathview-analytics.com/response.gif?url='+e(d.locati... I'd add a timestamp to the URL to avoid browser caching. And use a separate domain or subdomain because with any success adblockers will add the URL to their block lists an that could block the whole domain.

Some ad trackers use CNAME DNS entries, e.g. mytracker.mydomain.com points to tracker.pathview-analytics.com but adblockers also do the DNS resolution now, an arms race. (https://www.theregister.com/2021/02/24/dns_cname_tracking/)

  • shanebellone 2 years ago

    The timestamp is an awesome idea. Thank you! Why would an adblocker target an analytics app? Misidentification?

    • tuvan 2 years ago

      Adblockers usually have privacy filters as well. Thats why some analytics apps recommend proxying over website servers instead of directly sending requests to analytics apps endpoints from the client.

      • shanebellone 2 years ago

        Thank you for the clarification. That gives me something to think about.

    • korlja 2 years ago

      Analytics is generally (in detail this might or might not apply for this project) seen as an invasion of privacy, wasting bandwidth, increasing load time and lowering performance. There is a population of users who would gladly accept advertisements without analytics, because they see the invasion into their privacy as the predominant evil. This is why most adblockers nowadays either block analytics by default, or at least provide a configuration to also block analytics.

      • shanebellone 2 years ago

        I agree with your statement. I did originally build this for myself, with privacy in mind. I don't like being tracked either. Pathview doesn't rely on personal data but the general perception remains true. Any thoughts on navigating through that stigma? It's worth mentioning that the first hit generally loads in ~200ms and subsequent hits in ~120ms. The difference between first and subsequent is SSL. Speed and footprint represent two of my main design considerations.

        • korlja 2 years ago

          I guess the stigma is too established to get rid of. Maybe you can sway some users by transparency, i.e. a very thorough but user-friendly explanation about what your software is doing and how it cannot possibly be used to invade their privacy.

          But unfortunately, as far as my opinion goes, any kind of analytics and tracking just results in an instant "yuck" reaction, like a spider landing on my lap. I don't bother with analyzing it, I'll just try to get rid of it as quickly as possible.

          The notion of privacy-friendly analytics has also been thoroughly burned by sleazy marketing departments outright lying. Or technical solutions that claimed to be privacy-friendly, but actually didn't really because of technical reasons. Or technical solutions being so complicated and obscure that it might as well be a privacy-protecting voodoo ritual for all a user knows.

          • shanebellone 2 years ago

            This is tough. I dislike tracking but approve of analytics. Without data, websites cannot improve. Without improvements we'd only have Craigslists.

            In your opinion, is there a way to balance the need for feedback with respect for the user? What might that solution look like? Do you have any absolute demands?

            • korlja 2 years ago

              As a user, I've yet to see the user-facing benefits of analytics. I suspect there might be some which I don't know about. But mostly what I see is "we cancelled feature X you care about because analytics told us nobody uses it" and "you now get this annoying newsletter popup, because analytics told us we get more subscriptions that way".

              For that perception to change, you have to educate users about their concrete, relevant and obvious benefit from analytics. I think this is hard or impossible. I also think that all the bad players in the market make this even more impossible, because you get lumped in with them.

              I think the easiest solution is log analytics, preferably from anonymized or pseudonymized logs that are present anyways. That way, you don't collect any extra data, and as long as you do not keep the logs but only aggregated results, privacy isn't an issue. While a privacy policy and legal team need of course be aware of log analytics, the users cannot adblock it away, so that might be a plus. Also, no scripts, no cookies, no performance impact, etc. But of course the insight is limited by whatever is logged. Maybe some (privacy-preserving) data can be added to the URL parameters to augment the logs and provide a little more insight.

              Another solution (that I just thought of, no idea if it would work) is that of recruiting users for testing your website under observation by the UI team. While this might invoke the image of recruiting 20 people off the street and sitting them down in a lab, I have something totally online in mind: Offer a voucher (or something) in return for participation. Participation should be instant. The users session should be connected such that the UI people on duty can see the website interaction (ala VNC, but limited to the website in question, so this should be possible by getting geometry, mouse position and keypresses alone via javascript). In case of difficulties, the UI team can interact with the user via voice chat (preferred) or text chat. After the user has finished their task, maybe ask them a few extra questions. You will gain much better insights, because you can ask for motivations and problems. You can point the user at the intended way and see if it works at all. But of course this approach requires lots of manpower and is technically challenging.

              My absolute demands would be: Respect the relevant laws ala GDPR. Respect the DNT bit my browser sends. That way, you would already be above 99% of the analytics industry imho.

              • shanebellone 2 years ago

                "I think the easiest solution is log analytics, preferably from anonymized or pseudonymized logs that are present anyways."

                "Maybe some (privacy-preserving) data can be added to the URL parameters to augment the logs and provide a little more insight."

                Pathview iterates on the server log approach. JavaScript collects two pieces of information: the current page and the referring page. The rest of the data is acquired by parsing HTTP Messages in real-time.

                The only difference is cloud vs. self-hosted.

    • mtmail 2 years ago

      Some users want to block only ads (visible content), some block everything including analytics, support widgets, mouse-over widgets, social media links, "back to top" links on pages. Some lists even block everything by file name, e.g. /tracker.php regardless of domain.

      • shanebellone 2 years ago

        That's wild. I like the luxuries. Thanks for the perspective.

        • Semaphor 2 years ago

          Just to add, some of us even block all third-party domains by default. I had to specifically allow "pathview-analytics.com" to see what your script does ;)

          • shanebellone 2 years ago

            Do you have a reason that extends beyond preference? Is the result worth the price?

            • michaelt 2 years ago

              It tends to be good for security.

              For example, it blocks a great many XSS attacks, as if every website had a strict content-security-policy header. Or if some joker on a website adds <img src="http://192.168.0.1/reboot-router.php"> or suchlike you're protected.

              Websites that want to host sketchy untrusted content use iframes to external domains, so the sketchy content can't grab the user's cookies. If the website didn't trust the third party, why should you?

              It can also block a variety of "features" that are actually annoyances - like third-party live chat popups, third-party cookie consent nag screens etc.

              In terms of the price, how troublesome it is will depend on your web browsing needs. If you're a professional buyer visiting dozens of different companies' websites every day, you might find it inconvenient. But if most of your time is divided between your 10 favourite websites? Once you've got the whitelist right you'll barely notice it.

            • Semaphor 2 years ago

              The vast majority of sites work with some defaults (mainly CDNs), and it stops almost all 3rd party tracking. The minor inconvenience of sometimes having to whitelist some stuff is acceptable to me.

        • i67vw3 2 years ago

          Also, some users block javascript altogether (pathview uses javascript) and only allow it on temporary or permanent basis on selected sites.

          There are also anti-analytics filter list compatible with adblockers (default or opt-in) which might potentially add pathview in their database.

          • shanebellone 2 years ago

            That's ok. I'm not trying to track everyone. If a someone wants to opt out, that should be respected. I'm trying to build utility while respecting visitors' privacy. There are certainly tradeoffs, but I'm comfortable making them.

freemint 2 years ago

What would i need to add to the my websites privacy policy?

What would i need to do to be GDPR and CCPA compliant?

Those questions go completely unanswered and should be part of https://pathview.io/get-started . If those questions are not unanswered or addressed, the service is a liability. Answering those questions also gives a lot of transperancy to your customer what you are and are not doing. There should be an automatism and promise to inform people that an update to the privacy policy is required.

  • shanebellone 2 years ago

    You don't need a cookie banner or special privacy policy. It doesn't track personal data. You're absolutely right though, the website should mention those things. I'm hesitant to make claims about GDPR and CCPA. Those claims come with questions about the underlying mechanism which is only covered by a provisional patent. I really need a lawyer to move forward on those fronts.

    Pathview is about two months old, and the website is less than a week. Clarifications will be required, and decisions will be made. Thank you for the extra motivation.

    • littlestymaar 2 years ago

      If you're tracking users, no matter if it's done via cookies or something else, then if falls under GDPR and you at least have something to do about it. GDPR's definition of personal data is much broader than one may expect, you should definitely ask a lawyer on this subject to be sure you're compliant.

      • shanebellone 2 years ago

        Only if you use personal data as defined by GDPR which includes the transmission, processing, and storing of IP addresses or their hashed equivalents. A salted hash cannot be attributed to an individual user. The EU position regards the IPV6 range as sufficiently small to attribute a hash back to an IP with enough motivation and resources.

        Pathview strictly uses the salted hash to count unique visits. It's as minimally invasive as possible.

        • freemint 2 years ago

          I think salted is not enough but peppered per domain and salted sounds good. That way you can not learn what IP visited different websites you track. So hash(IP+domain_bucket_specified_by_user+SALT)

          • shanebellone 2 years ago

            Fantstic idea! I'll immediately implement this.

            • freemint 2 years ago

              Customized buckets (to aggregate users between multiple (sub-)domains) could be a premium feature if you ever go the freemium route.

              • shanebellone 2 years ago

                I've implemented this solution. Great idea. Thanks again for your feedback.

rgavuliak 2 years ago

What does this have to do with attribution? Attribution is generally concerned with attributing revenue to marketing channels.

  • shanebellone 2 years ago

    I'm attributing individual conversions to conversion paths (starting with the referer). This is not a multi-channel solution. Pathview will help users optimize their websites to improve conversion rates. Does that clarify?

    • njitram 2 years ago

      Was confused about this as well; attribution is a term used by marketer for attributing revenue to channel and campaigns (cost), so you might want to make it clear that this is really about single website attribution. Is it single session attribution or can you measure if a user enter the website and convert a few days later?

      • shanebellone 2 years ago

        Point noted. I'll be careful with my language. Thanks for reiterating.

        This system is focused on the source of the hit and the page views that led to the conversion. It could be used to measure organic search ROI for individual content pages. You could justify content spend based on performance data and plan future content based on past returns. I find this use case particularly interesting.

        It is capable of multi-session attribution. From a product perspective, the tradeoff is timeframe for efficiency. The longer the historical perspective, the more costly feature becomes. This represents a current personal debate about tradeoffs.

        • njitram 2 years ago

          You will probably quickly run into asks from customers about 'attribution modelling' concepts and multi touch attribution, you now probably created first touch attribution?

          • shanebellone 2 years ago

            I'm currently showing first touch but can add a section that depicts last touch too. It's just a question of presentation. It would take ~15 minutes to make the change.

            Edit: Took 25 minutes but I pushed an update to the example report to highlight this point.

            • njitram 2 years ago

              Nice! There are quite a few other ways that advanced marketers want to analyze this if you google for attribution modelling, but first and last touch is already a great start

t0mas88 2 years ago

I think you should add a bit more information on how you calculate the metrics without using cookies. For example how do you count unique visitors if you don't track any user information and don't set any cookies?

  • shanebellone 2 years ago

    I'm using a salted hash to calculate unique visitors. The cleartext IP is never captured or stored and is only available because it's required to return a response.

    What is the right amount of information to share? Most users won't care but others will. I want to simplify rather than complicate. Finding balance is difficult.

    • closewith 2 years ago

      > I'm using a salted hash to calculate unique visitors.

      Where do you store the salted hash?

      • shanebellone 2 years ago

        The salted hash is stored in a database.

        • closewith 2 years ago

          Where is it stored on the client? Or is it a salted hash of the IP address used to connect? If so, the recent precedent from CNIL wouldn't consider that exempt from GDPR cookie banner unless obscured via a first-party proxy.

          • shanebellone 2 years ago

            It's the salted hash of the IP used to connect.

            I did build my approach around the Internet's backbone to future proof. Banning my approach would fundamentally break the Internet. I've made a note and will read into the precedent. Thank you for the research topic.

            • closewith 2 years ago

              > Banning my approach would fundamentally break the Internet.

              That’s simply false. The legal basis for processing an IP address in order to serve an HTTP request in the EU is Art 6 (b):

              > processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract;

              This does not give you the right to use that data for any other purpose.

              The approach you’re using is called fingerprinting in the industry (relatively naive fingerprinting given it only uses IP address) and it is the ongoing subject of enforcement action in the EU. Nothing about the approach is compliant.

              Frankly, if it was this easy to forgo the regulatory impact of the GDPR, every analytics service would do it (many shady ones do).

              • shanebellone 2 years ago

                Pathview does not fingerprint. Full stop.

                The salted hash (of the IP address) is used for one purpose: to count unique visits.

                • closewith 2 years ago

                  To associate individual page views to a single real IP address. That is fingerprinting.

                  • shanebellone 2 years ago

                    Pathview does not associate individual page views to single real IP addresses.

                    • closewith 2 years ago

                      Quite honestly, you are not competent to make that determination.

                      For the purposes of compliance with the GDPR, Pathview does associate individual page views with a single real IP address. From the precedent set by the CNIL ruling on Google Analytics in July, the only way for an analytics tool not to process personal data in the form of an IP address is to relay the request via a first-party proxy (and strip the referer).

                      Your analytics tool cannot accept HTTP requests from the users browser and be compliant with the GDPR without gaining consent. Serving an HTTP request made for the purposes of analytics is processing personal data and there's no legal basis for the processing without consent.

                      > The fundamental problem that prevents these measures from addressing the issue of access of data by non-European authorities is that of direct contact, via an HTTPS connection, between the individual's terminal and servers managed by Google.

                      > The resulting requests allow these servers to obtain the IP address of the Internet user as well as a lot of information about his terminal. This information may realistically allow the user to be re-identified and, consequently, to access his or her browsing on all sites using Google Analytics.

                      > Only solutions allowing to break this contact between the terminal and the server can address this issue. Beyond the case of Google Analytics, this type of solution could also make it possible to reconcile the use of other analytics tools with the GDPR rules on data transfer.

                      Source: https://www.cnil.fr/en/google-analytics-and-data-transfers-h...

                      • shanebellone 2 years ago

                        I've removed the hashed IP address. It is no longer stored and therefore cannot be processed.

                        As I've said multiple times, I will be consulting with a lawyer to ensure Pathview is in compliance with relevant privacy laws. You might be a lawyer (who knows), but I am definitely not.

    • njitram 2 years ago

      I think most people (including me) are looking for how it works to check if it really is (somewhat) GDPR compliant, so for example you want to know if data is stored in aggregate or also individual data (which would fall under GDPR), if individual users are tracked with some kind of identifier, how fast the identifier is deleted, if its stored in disk or only in memory, who can access the identifier, is the identifier pseudonized etc.

      • shanebellone 2 years ago

        I'm happy to answer any question about what data is acquired and how it is acquired. I cannot answer every question about how my system processes the data. I built a novel algorithm that represents economic leverage.

        Each hit is stored without personal data but including a salted hash representing the IP. Users are not tracked and are not assigned any type of individual identifier.

        • t0mas88 2 years ago

          OK, that's what I was looking for. Storing hits with any kind of identifier is considered processing of personal data under GDPR. It doesn't matter whether you have assigned the identifier or whether you use one that the user already has (IP, device ID etc). Hashing / salting the identifier does not change that if it's still unique.

          The way to make it compliant is to ask permission for using the data. Or doing your analysis without any user identifiers, but that doesn't get you much useful insights.

          • shanebellone 2 years ago

            To be clear, I need to count unique visits without using a salted IP address even when it cannot be attributed to an individual or used to track them across multiple websites?

            I do have an idea that might work for this scenario. If I can calculate unique visits differently, I can drop the salted hash from the database too. I'm guessing that should be sufficient to satisfy most privacy conscious users.

            • njitram 2 years ago

              I think it's fairly fundamental; to see if a visitor has accessed one page, and later on another page, you need to track that visitor somehow. So you need some kind of identifier. Even using an IP address (hashed or not) or assigning a random ID all falls under GDPR regulation. Alternative 'tricks' like link decoration could work maybe but you have to rewrite all URLs, which is very error prone. Creating cnames for every customer is another option, called cname cloaking, but it has other drawbacks and it's probably also not GDPR compliant. Would definitely be interesting if there is a solution for this problem, as I agree that there are very valid usecases for attribution, but its very hard to do as (very limited) tracking is almost mandatory to do this. You could work 'around' legislations by checking where a visitor comes from and track (only) in those regions where its allowed, and in other areas ask for permission? You will miss visitors, but you can extrapolate counters to compensate potentially.

              • shanebellone 2 years ago

                I can count unique sessions without using or storing the IP address or an identifier. Maybe that represents a fair middle ground?

                Edit: I implemented this approach. It's less accurate but removes the need for any representation of the IP address.

                • njitram 2 years ago

                  Cool, not sure how you did that of course, you should probably if you want to grow this do an external code audit or something like that, but great job, also how fast you change stuff!

                  • shanebellone 2 years ago

                    Thanks! The core code (capture and post-processing) sits around 250 lines. It's short enough to retain which makes iterating quick and fairly painless.

                    Considering I'm about 2-months in, I'm happy with the progress and general direction.

                    • njitram 2 years ago

                      Nice! If you want to discuss more, happy to talk, see my profile for contact info

TekMol 2 years ago

What is the motivation to do the tracking without cookies?

  • shanebellone 2 years ago

    I hate cookie banners.

    • madeofpalk 2 years ago

      fwiw 'cookie banners' aren't about the technology cookies, but rather about gaining consent for tracking and storing data on users. Even if you do this without cookies, if you still track users (like for attribution) for non-essential purposes, then you could still need consent from a "cookie banner".

      a hash of an ip address could still be 'personal data' under the eyes of gdpr.

      • korlja 2 years ago

        > a hash of an ip address could still be 'personal data' under the eyes of gdpr.

        We did something similar for a project, which got approved by the relevant data protection officer: hash(IP + daily secret) as an identifier in the logs. This will be used to count unique visitors, the wraparound at 24:00:00 didn't matter to us. The daily secret is just a random number that our one (small setup) application server generates each day. It is never written out to disk or database, so an appserver restart also recreates that secret, it is strictly kept in RAM. That way, we could argue that, barring extreme measures like attaching a debugger to get the secret, we technically prevented deanonymisation.

        But that was just a small-scale project, has never been tested in court and the usual YMMV, IANAL, ...

        Edit: I think some webservers can be configured to do something similar

      • shanebellone 2 years ago

        I appreciate and understand that. I responded with this point to another comment.

        GDPR does aggressively define what can and cannot be done. As far as I know, I'm adhering to their definitions and guidelines. I will consult with a lawyer to confirm before I make claims about GDPR or other laws.

        My goal is to make the best product possible given the constraints set. I believe there is middle ground between user privacy and analytics.

    • TekMol 2 years ago

      There is a popular misconception "If you use cookies, you have to show a cookie banner". The origin is a complete misunderstanding of privacy laws (GDPR etc) around the world. Privacy laws are not about technology. They are about privacy.

      Seeing your answer, I have the feeling that this misconception is the basis for your project.

      • shanebellone 2 years ago

        Are you specifically referring to the difference between 1st party and 3rd party cookies?

        • gildas 2 years ago

          He means that cookies are an implementation detail, the fact of using alternative techniques doesn't absolutely change anything regarding the law. And it is perfectly legal to store cookies (e.g. a session token) without asking the user's permission if it is within the law, i.e. if you don't use it for tracking users without their consent.