Facebook engineers: we have no idea where we keep all your personal data

216 points by mikece 3 years ago

ctur 3 years ago

In many ways you can think of large, long-living tech companies not unlike old cities like, say, London or Paris. The buildings and roads you see are built on top of older buildings and ruins. The streets are weirdly shaped and intersect at odd angles because they were made hundreds of years before and adapted over time as needs evolved. There are catacombs underneath sidewalks and no one genuinely understands it all nor does a single reference exist that explains it.

Literally everything predates everyone who lives there. Generations and generations of original designers, architects, and laborers have arrived, plied their trade, and moved away. There are people who are experts in certain parts, and who can build a new skyscraper at any given spot, but it is just layering and organic growth.

The emergent complexity of centuries of being lived in and adapted belies easy understanding.

Large tech companies are similar. You just can't understand how "it" all works. If you were to build it from scratch, perhaps you could, because it would be simpler and clearer, but nothing was made with the current state in mind. It evolved and adapted over time.

So reading this, I am not surprised. I think you'd get the same answer about many other aspects of data, code, system history, etc at any other 10+ yr old tech giant.

borbulon 3 years ago

yeah "favela" is usually the metaphor I use, but it's the same general gist, if a little messier.
CaptainTaboo 3 years ago

It's not just the tech giants. I work for a company that is fairly dominant in its particular industry (If you're in the US, you've probably done business with one of its customers). Much of our growth has come through the acquisition of other companies, each of which gathers its own user and customer data. From my perspective in IT operations, I don't have any visibility into how we capture user and customer data, but I know we're big enough to track a user across the various products and websites, but still siloed enough that internally we still refer to the various business units as if they were separate organizations, each holding its own subset of the data. To sum up, this description of Facebook's internals doesn't surprise me at all.
kixiQu 3 years ago

I work for a large long-living tech company. Where legal compliance and security are concerned, "it is just layering and organic growth" is not something you get to say. If we were speculating about such a thing on a coffee break, maybe we'd get the kind of answer given here, but if a single reference doesn't exist for something that is necessary for compliance, people get paged until it does.
(There may be other things taken as seriously I'm not thinking of, but compliance and security are the two I've seen drag people out of their beds by their ankles)
- encoderer 3 years ago
  
  Yes and it’s a good thing that most industries do not carry this governance burden.
  
  mschuster91 3 years ago
  
  If more industries did carry this governance burden, we'd see a lot less cases of companies getting hacked left and right.
  For way too many companies, "IT" is just another cost center that doesn't bring in any profits and so is barely kept at life support levels. And when something inevitably happens, it's rarely the C-level that gets to go to the unemployment office, it's more one poor soul who has complained for years that he needs more staff and budget to at least get the most egregious bullshit fixed.
  
  phpisthebest 3 years ago
  
  I doubt that, many of the companies that have been hacked were governed by these regulations, and they spent money governing to the regulations not governing to security which is not the same thing
  If I was burdened by regulation and red tape I can assure you my org would be less secure not more so
  
  seti0Cha 3 years ago
  
  Hah, I might have agreed before I worked on a credit card processing system. There were plenty of rules alright, but it's possible to abide by all the rules without actually making a secure system. Some of the rules made it harder to create a secure system because they did not reflect the state of the art, or they made adding additional security too burdensome to implement. In the end, you can't force competence through rules.
  
  closewith 3 years ago
  
  > In the end, you can't force competence through rules.
  I mean, we have in all other forms of engineering. Why is software different?
  
  bigbillheck 3 years ago
  
  Sounds like a pretty good argument for 'software engineering is not engineering'.
  
  closewith 3 years ago
  
  I think we can all (mostly?) agree that most of what's called software engineering doesn't live up to the standards (including responsibilities of the developers/engineers) of the other formalised engineering disciplines.
  
  seti0Cha 3 years ago
  
  The software world evolves much faster than any of the engineering disciplines and has fewer physical constraints. The requirements for software are rarely fully specified when construction starts. Software is faster to build, deploy and fix so getting it right the first time is not as important. The environment in which it is deployed changes rapidly and in entirely unexpected ways, which is not the case for things built by engineers. There are plenty of differences. In practice, software is more like a craft than an engineering discipline.
  
  closewith 3 years ago
  
  > The software world evolves much faster than any of the engineering disciplines and has fewer physical constraints. The requirements for software are rarely fully specified when construction starts. Software is faster to build, deploy and fix so getting it right the first time is not as important. The environment in which it is deployed changes rapidly and in entirely unexpected ways, which is not the case for things built by engineers.
  These are all artefacts of software development as we practice it, rather than truisms.
  > In practice, software is more like a craft than an engineering discipline.
  Maybe we should reserve the term software engineering for the applications of software development where rigour is involved?
  
  seti0Cha 3 years ago
  
  > These are all artefacts of software development as we practice it, rather than truisms.
  Well, they're not laws of nature or anything. But pragmatically, how could it be otherwise? When it comes to bridges and buildings, correctness beats timeliness, but when it comes to software the opposite is usually true. There are exceptions of course (e.g. medical software)
  > Maybe we should reserve the term software engineering for the applications of software development where rigour is involved?
  Agree.
  
  kixiQu 3 years ago
  
  They do have a burden. Following the law is table stakes. Now, are the laws relevant to a business the same across industries and around the globe? No. Does enforcement have its shit together equally everywhere? Obviously not. But making every team in a company classify the XYZ their component stores is absolutely doable and frankly an easy lift when the alternative is not doing business that involves XYZ.
  
  PuppyTailWags 3 years ago
  
  I think specifically any industry working in personal data collection of minors should absolutely have this governance burden. If Meta knows enough about their users to know Instagram is giving teenage girls mental illnesses, Meta engineers should absolutely be hoisted out of their beds if they don't have precise, well-defined, secure data stores.
  
  conductr 3 years ago
  
  The problem with your example is the engineers have built exactly what the company wanted. Perhaps the mental illness was an unintended side effect (perhaps known). But either way a decision was made that they want X and so the engineers built X. What you're talking about is banning X, which is something a bit difference than a governance burden/compliance issue as those can generally be followed to a specification and directly measured for deficiency (eg. how to handle HIPAA data, how long to store records, etc.)
  If the issue is about teenagers being too immature for the product X, then typically we ban it or age restrict it using legislation (eg tobacco, alcohol, porn, etc)
  I think we as a society need to figure out where social media fits in and how we want to contain it and it's still early days for that. As much as I like the idea of an age restriction, I look around the world and see 6 month olds playing with iPads, 6 year olds on tiktok, etc. as the more 'average' way to parent and wonder if I'm in the minority. My kid is 4 and has never touched a device as entertainment, we taught him how to eat out at restaurants/be in public instead of pacifying him. But that seems like an old fashioned parenting method.
  Ok, he's technically used it briefly as a viewfinder for the camera and to facetime his grandparents
  
  TomSwirly 3 years ago
  
  > it's still early days for that.
  We've had social networks for twenty five years.
- nicolas_t 3 years ago
  
  Counterpoint, in my career, I've seen a lot of Security Audits that seem on paper to be very thorough and detailed but when you dig deeper, you can see that it only touched the surface and the auditor missed a lot of things.
  
  delusional 3 years ago
  
  I have in fact only seen these kinds.
  
  vidanay 3 years ago
  
  Same with certifications. (e.g. ISO 900x certification - you can still build a shit product, but it will be a very well documented shit product.)
  
  tremon 3 years ago
  
  I'd guess there's a difference between an external audit performed for regulatory of financial reasons, and an audit by the internal legal team because of a pending legal case or regulatory scrutiny.
  
  bravetraveler 3 years ago
  
  There's a very wide gap in what needs to be covered as a result of this consideration!
  Disaster recovery/business continuity? Absolutely, you need it documented so that someone 'off the street' could be helpful.
  Implementing yet another service gateway? Probably going to take a lot of conversations with people familiar with the streets involved.
  Auditors won't care until it's realized. They'll want things like interconnection diagrams -- painting a picture so they can follow your adherence.
  We have very high compliance requirements where I work -- about half of my time is spent on it.
  They're less worried about how the bread is made, more that you can keep making bread the way you say you do.
  
  kixiQu 3 years ago
  
  I mean, sure, security is complicated, lots of judgment calls, but within this mindset, if a team/org/company said "we have no idea where we're using log4j" it was then their job to have someone working around the clock until they figured that out and fixed it. Figuring out how to meet new requirements that weren't around when systems were begun is, IMO, fully just part of the job. For FB engineers to be throwing up their hands in front of a legal system like oh nooooo how could we possibly track things seems like a stalling technique and makes all of tech look bad.
  
  ACow_Adonis 3 years ago
  
  as per my other comment, in most of the sufficiently big or historical cases you have to distinguish between two things:
  - reality
  - the socially acceptable fiction in the report
  the exercise would be to create an output that allows people to pretend we know where all instances of "log4j" were used, or that sufficient depth and resources have been expended on such.
  but in a sufficiently large and complex organisation, we (and by we I mean technically able, knowledgeable or sufficiently aware individuals) know that the content of such is likely to be as factually true as reading a historian's report entitled: "enumeration of all points in past century where cutlery has been used".
  
  kixiQu 3 years ago
  
  On a spectrum of ridiculousness vs. necessity running from "enumerate where cutlery was used for the past one hundred years" to "keep tabs on the radioactive material we have in storage", it is a choice to vacuum up user data and treat it as the former rather than the latter. It is not unavoidable; it is that Facebook does not want to avoid it. If they did want to avoid it, they could do so in the same way that my employer woke me up to verify that there was no log4j running on some machines that did not have Java installed: unpleasant campaigns, building automated checks, checking the checks, pouring manual effort into the gaps, paying the costs. Slowing down development, maybe. Shifting resources. Rendering some designs infeasible (okay, well, not for log4j, but for a compliance thing I'm thinking of).
  It is very important that we are precise here: they are trying to come up with a story to make this about technicalities, and it is actually about their priorities. People who've done large-scale engineering need to be clear that even within a large organization, things do get hunted down if and only if the organization prioritizes them. Measurements and control mechanisms are built. Systems, social and technical, are designed.
  And if they're not, Facebook can't honestly say oh but it was so hard it's impossible, we're just too big. We're not talking about detecting I-know-it-when-I-see-it obscenity or whatever social thing people hope AI will fix for them, we're talking about knowing how data is being plumbed from one place to another, and where it lives when it's at rest. It's an embarrassment that this is being presented as a Hard Problem.
  
  bentinney 3 years ago
  
  That was an impressively long threaded argument just to agree with what the original poster stated. Essentially this boils down to the following: cities and large tech companies are hugely complicated endeavors and if enough money and/or manpower is placed into a "documentation project" we could fully describe either system (using a combination of reality and socially acceptable fiction.) However, it would mostly likely be outdated by the time the project was finished - unless we froze the city/company in place which would most likely kill it.
  
  kixiQu 3 years ago
  
  No, I do not agree with what the original poster stated.
  I think it's possible to get too abstract in discussion with an idea of "documenting a system" since that can of course shoot for a level of detail that can't be maintained. Here is a practical example.
  https://en.wikipedia.org/wiki/Personal_Information_Protectio...
  When this law was to come into effect, every company that does business in China had to audit where they were storing all Chinese PII in order to figure out what they might need to change. For large companies, this involved making everyone perform this audit for their own little fiefdoms.
  Being as this wasn't the kind of thing one can have lawyers go challenge in court, there were two options:
  1. Actually do an audit (and if you're not dealing with exceptionally poor engineering, this should not have been as hard as keeping track of dependency libraries), rebuild everything from scratch (wasteful), or nuke absolutely anything that might have a chance of having Chinese PII (...unlikely from a business perspective).
  2. Lie (internally / externally) and thereby break the law.
  People in these comments seem to love the phrase "socially acceptable fiction" and I want to be very crisp that we are dealing with a clean dichotomy. You do the audit, or you lie and break the law.
  Yes, I understand there are some among us who have so little integrity that they'll shrug at the latter option, that they'd fake the VW emissions test, that they'll cheat and embezzle and tell themselves that it was all inevitable anyway and anyone in their places would do the same. But this is not actually true. We would not all do this. Pretending this is normal variance gives cover to scummy people.
  I therefore object to the root comment's characterization of this kind of thing exceeding real understanding as a natural consequence of scale/time. Yes, in terms of how it can come about that no individual knows this stuff, it's accurate – but only in the absence of the institution itself caring to pay attention. I object to succeeding characterizations that it is an inevitable attribute of social systems that meaningful control is impossible, because even educated and carefully hired professionals can apparently only be expected to behave with the ethics one might expect of a sixth grader who would rather let Chegg write his book report.
  From the article, this is what FB is saying:
  > “We do not have an adequate level of control and explainability over how our systems use data, and thus we can’t confidently make controlled policy changes or external commitments such as ‘we will not use X data for Y purpose,’” the 2021 document read.
  This may be accurate about Facebook; I would not know. What I do know is that it is not an inevitable consequence of scale and lifespan.
  
  TomSwirly 3 years ago
  
  Then these organizations should be torn down, their executives jailed, and replaced by organizations that are not deliberately designed so compliance with data privacy and retention laws is completely impossible.
  
  krinchan 3 years ago
  
  Someone has been the subject of a SOC audit.
- ACow_Adonis 3 years ago
  
  Counterpoint: the conclusions of legal, compliance and security rest upon social fictions and power structures, but fundamentally at a certain scale, all things act like the original commenter said.
  In my experience with the above, people get paged until a "socially plausible fiction or resolution" can be created.
  There's several versions of these, one of which is to engage in enough rituals and processes so that we all pretend the task has been completed and the requisite outputs and points of reference exist. An archaeologist knows they don't and they can't and that we lay on bedrocks of speculation, history, ignorance and myth, but we don't have to worry too much about the fiction we create because we also know that disproving such mechanically is also extremely impossible/unlikely.
  And we have a host of other social functions to get out of the cases where it could, such as:
  - throwing people under the bus/human sacrifice/ finger pointing
  - contrition
  - plausible deniabilty
  - continued non-compliance and non-enforcement
  - bribery
  - bullshit jobs
- marcosdumay 3 years ago
  
  > if a single reference doesn't exist for something that is necessary for compliance, people get paged until it does.
  And you will trust the reference like if it perfectly describes the reality. Because, really, there is no other way to deal with it. You can test a thing or two and improve it, but you can't test for the absence of necessary information.
  The Facebook situation is inexcusable. But no company has perfect knowledge about this either ("perfect" here carrying a lot of weight).
- vineyardmike 3 years ago
  
  I’ve also worked for big tech companies. On one of my teams, all those docs are written by college hires as a “learning opportunity” and they’re surface level only. Outside of medical or financial systems, I’m guessing most of those docs aren’t worth their weight in bytes.
  Here is one conversation I’ve heard: “We stream our logs [full of PII] to a central repo. Do we need to include that in our docs?”
  “No. It’s just logging. It’s a standardized tool we use. It’s probably assumed. Just include the database and known clients.”.
markstos 3 years ago

You can hire one or more data privacy people whose whole job it is to track this. They get an inventory of all the systems and talk to the managers of every system, find the answers and document them.
Meta is choosing not to know or pretending not to know. They have the resources to know where your data goes.
- UncleMeat 3 years ago
  
  I think you are underestimating the complexity of this task by about three orders of magnitude. This is the way to do it, but you can't do it for a company of this scale with a handful of people.
  
  RHSeeger 3 years ago
  
  But pretending they can only afford "a handful of people" to work on the problem is a little silly, too. They don't have the answers because they choose not to devote the resources to get the answers; not because they are unable to devote the resources.
  
  UncleMeat 3 years ago
  
  I agree with that. Companies like Facebook should have a large org responsible for this problem. My point is that the problem is way more complex than some might think.
  
  citiguy 3 years ago
  
  Agreed - when I worked in the financial industry there was something called data lifecycle management. Since keeping data around indefinitely meant you could be held liable at any time for anything that happened in the distant past, companies are motivated to track data and delete anything that's no longer necessary for business as soon as possible.
  No such law exists for Facebook, so they don't care. Someone needs to pass a law with a big fine attached and then things will change.
  
  ImPostingOnHN 3 years ago
  
  it's also way simpler than some might think, in that it's simply a matter of Facebook choosing not to devote enough resources to it, and thus choosing not to know
  
  dylan604 3 years ago
  
  Not my problem of how many people because of evilCo's size. That's their problem, and I shouldn't even be aware of it being a problem.
  
  tremon 3 years ago
  
  Then, first and foremost, it is the task of these people to bring down the complexity of the task. You can't reason your way out of legal obligations by throwing your hands up and saying it's too hard.
- klooney 3 years ago
  
  I mean, I've worked with those teams for GDPR stuff. They have no context, and no access or ability to trace data flow, so they often write up very partial and incomplete reports because no one else cares particularly deeply about the problem.
- larnik 3 years ago
  
  Hiring one person to do this at a company this scale, will give you results in the next decade. BTW, when the report is complete, it will no longer be accurate. Sure you can scale the process by putting more people on it, but it won't scale linearly. Additionally you will need buy in from all the system managers to dedicate time to assist the new data privacy team. It is doable, but comes at a cost.
mistrial9 3 years ago

this useful excuse-insight will be provided to all compliance personnel at the next staff meeting, in writing.
BiteCode_dev 3 years ago

Maybe, but one can find out. You can follow the chain of calls of any software, and they have access to everything, as well as the whole history.
It's like when in an interview Buffets says it's very hard to know where all the money is in the financial system.
Sure, it's a complex system, but it's more of a matter of incentive.
- withinboredom 3 years ago
  
  You can follow the chains to a degree, but that assumes you have access. At work, you can follow things until it gets to a spool file, but you won’t have access to the system/code that reads the file.
- IanCal 3 years ago
  
  Unless it was built with this in mind, you absolutely don't have access to this.
  Full data lineage is a complex task.
  A very simple example would just be that a database is destructively updated by some script at some time. What's the log here? What ran? Can you actually replicate it? Do you have full backup of the data before?
  It's so easy to lose this.
- ACow_Adonis 3 years ago
  
  As someone who's arguably professionally employed to do things like tell people where money is in the financial system, no, you can't.
  In the same way you can't enumerate the atoms or molecules in a human body. sure, you can come up with good estimates and summaries that are useful, but not the actual full and truthful state of affairs.
  In addition, you are a self-referential participant in the system: so acting in an attempt to observe such likely immediately makes your information at least partially incorrect.
  Additionally, if we can't get full/transparency or introspection into machine learning models or neural networks (and we can't because i don't believe we even possess the theoretical knowledge of how to do so), it should also be wilfully apparent that we can't actually know the true lineage or dependencies of systems that implement such in production.
  Not to mention the difficulties as the inputs/outputs of such pass in and out of the notional entity or system being observed.
dvfjsdhgfv 3 years ago

As far as code goes, it's true for many companies. As for data, it was similar in many large European companies in the pre-GDPR era. Today, one crucial question is always asked: when you work with data, is this personal data? If it is, you need to deal it with a special way. Personal data is both an asset and a liability.
Most companies found a way do do it. It was a long and often very painful process, involving everything from data entry to backup processes and procedures, but somehow we managed to do it.
adrianh 3 years ago

This is a seductive notion, but it's indeed possible for a tech company to understand how a single piece of data goes through its systems.
It might take a while, and it might involve weeks of code spelunking and dozens of conversations with engineers — but it is indeed possible.
A codebase is not encumbered with the physical constaints of an old city. It's not like trying to figure out the physical state of a certain area of London 30 meters deep — which might be impossible without digging and therefore disrupting other structures in the area. A codebase is simply instructions, which are readable and understandable (with the exception of black-box machine-learning models, but those at least have defined inputs and outputs).
- leros 3 years ago
  
  The thing with tech though is that it's easy to add new connections, connecting anything to anything.
  You might have two independent datasets that can't be stored together due to privacy or security regulations. The teams who work on that know and understand that. But some other team working on something completely unrelated might join those datasets for some non-nefarious purpose, which then creates a new dataset with those data combined. Now other teams who also don't know about the regulations might build on top of that combined dataset. They might even generate new data fields from the data that's not supposed to be joined. This stuff can propogate through layers of systems, queries, service calls, analytical engines, offline spreadsheets, etc. It's very difficult to keep track while keeping your engineering teams independent and nimble.
- rufus_foreman 3 years ago
  
  >> It might take a while, and it might involve weeks of code spelunking and dozens of conversations with engineers — but it is indeed possible
  If it takes a while it will be out of date by the time it is complete.
  A single request might directly impact a dozen separate systems, which in turn call other services. You won't be able to find anyone who knows much past the layer they directly interact with. All of these services are being worked on. Some older ones are in the process of being deprecated in favor of newer ones. Someone is already working on the design of the replacements for the newer ones. Changes are being pushed weekly if not daily. If you send a request and then send it again a minute later, it might go through a completely different set of services based on a feature toggle.
  You can't step into the same river twice.
- notacoward 3 years ago
  
  > it might involve weeks of code spelunking and dozens of conversations with engineers
  It's worse than you think. First, a little background: I worked on one of Facebook's largest storage systems for two years. So let's talk about some concrete examples.
  I knew the rebalancing code better than anyone by the time I left. This is a pretty essential bit of functionality, running regularly on any cluster that has been up for any amount of time and moving quite a bit of data each time. After a chunk was copied, my code would go to delete the old one. I know for a fact that the deletion could fail without my code even getting a useful error from lower layers. I found and fixed many such cases, but I'm sure more remained. That could leave an "orphan" chunk where nobody would know to look for it, and chunks were self-identifying enough that if enough such orphans existed they could be reconstructed into a whole block possibly containing user data.
  I knew the data-repair code almost as well. Same problem. I knew the system-repair code less well. Similar there too. In fact I was involved in pulling back "repaired" systems and using them to recover data that would have been lost, more than once. I watched other engineers get rewarded for cleaning up petabytes' worth of no-longer-reachable data (because of cases like I'd mentioned, or bugs, or whatever) that was still taking up space on our millions of disks - again, multiple times. And that's all just one storage system. I'll bet others had similar issues. Also, the problems with truly erasing disks and particularly SSDs are pretty well known. If a machine was taken out of our system entirely and repurposed for another one, or vice versa, there could still be data on its platters/chips that could be recovered with sufficient forensic effort. (These things were physically destroyed before leaving FB, and I've even seen the impressive machines that do it, but not between "lifetimes" within the company.)
  So knowing the current code is not enough. You'd have to know every past state of the code during a relevant timeframe, including what bugs it had, which is challenging to say the least. Every configuration detail, too. You'd have to know every rebalancing, reconstruction, or repair event that might have affected each disk. It really is like figuring out the physical state of an area in London. Nothing short of scanning every sector, even as more hardware enters and leaves the system almost every minute, could produce an absolute guarantee.
  I'm not saying it can't be done in any system of similar size and complexity. Just that it couldn't in Facebook's, even without malice, largely because so much code was written without that in mind and it's a really hard thing to bolt on afterward. I know there are people working toward it. It's just going to take longer than most people - even technical people - think, and saying it's done would be a lie.
Jenk 3 years ago

I agree with the point you are making, but I disagree with the conclusion.
A company like Meta that has been building and advancing the field of ML, AI, and even falling foul of the ethics/morality of large scale public manipulation and political campaigning has already demonstrated they have the wherewithal to find where their data exists. Period.
- citiguy 3 years ago
  
  The issue is more that they don't care. They don't really need to track it so they don't.
liampulles 3 years ago

To extend this wonderful metaphor, perhaps the least terrible solution is for the "city workers" to constantly log any issues as they come across them in their normal day-to-day work. It is the responsibility of the city then to proactively go and fix those issues as they come up.
jollyllama 3 years ago

To build on your metaphor though, there are cabbies with "the knowledge" in London, and probably similar folks in Paris. The way I have seen this work in tech is, you have to keep the people who know where the bodies are buried around for a long time, and others will use them as a resource to inform their efforts. The alternative, of letting them walk over the years, is that large sections of your stack will have "here be dragons" written over them and become no-go zones where your only option is to route not only engineering but business efforts around them.
slenk 3 years ago

Not tech giants that care about compliance...it's not undoable.
They are just poor stewards.
rockinghigh 3 years ago

> I think you'd get the same answer about many other aspects of data, code, system history, etc at any other 10+ yr old tech giant.
Processes to access personal data vary wildly among tech companies. At Apple, when a machine learning engineer wants to store and use data on the server side, they need to go through layers of approval from lawyers. Even when approved, it often comes with serious constraints about what can be done with the data. Meta is a lot more lax.
TomSwirly 3 years ago

What you appear to be saying is that large technology companies are designed from the ground up around flouting the laws about data retention.
Why is this OK?
weare138 3 years ago

That's just a truism at any tech company. The issue is Facebook/Meta was misleading about what data is being gathered and the "Download Your Information" didn't actually show you what data FB had gathered about you. By their own admission they don't even know yet users and regulatory agencies were knowingly misled into believing that information was accurate.
Rackedup 3 years ago

> The buildings and roads you see are built on top of older buildings and ruins
It's easy to delete old data that is never accessed, so I don't get your point...

handity 3 years ago

After reading some more of the transcript, I think the article does such a bad job of describing what was being asked for that it makes the court seem incompetent and Facebook actually rather reasonable. My original comment is below:

Government contracted construction workers: We Have No Idea Where Your Tax Money Goes

When tasked with answering the simple question "Which specific bricks did my tax money buy this year", the two veteran construction workers looked confused and tried to explain that that's not really how taxes work.

The special master at times seemed in disbelief, as when he questioned the engineers over whether any invoices existed for a particular road building contract. “Someone must have a receipt that says this is who the money came from that bought these bricks”

I'm not sure the analogy fits 100%, but it's the closest I could think of. This reads like the author thinks the Facebook algorithm is a human-readable decision tree that only takes into account the data of a single user at a time.

As usual, the problem is not data "collection" or "retention" or "privacy", but "creation". Regulation will always remain woefully inadequate to control such organizations, and the only solution is to adopt systems that don't spew user data everywhere in the first place.

twanvl 3 years ago

Money is fungible, personal data is not.
- jtbayly 3 years ago
  
  Ha! “Fungible” is the word I settled on, too.
- guywithahat 3 years ago
  
  You'd think that but personal data piles up quickly and the individuals who make it up dispensary. It's a little like saying any particular dollar is unique; I suppose it is but if you get enough of them nobody cares. I think expecting Facebook to treat all of their data as if it's ITAR is unfortunately unrealistic, we're not all that unique or special and if you use the internet your data has already been leaked or is otherwise public
jtbayly 3 years ago

That’s an absurd analogy that doesn’t fit at all.
FB uses the data it can’t find or tell you anything about to successfully sell ad space to third parties targeted back at you.
Let’s make it concrete.
We know FB keeps track of which webpages you visit. We also know they use that data as a way to help target ads. That data isn’t in the data export they give you, as far as I can tell from a brief search.
Is there data coming in about which Instagram profiles you follow and what pics you clicked and liked? Is there other data they keep track of? Undoubtedly.
I’m not surprised nobody knows the answer to what all it might be. But pretending that data is fungible like money is just misdirection.
- nathanaldensr 3 years ago
  
  Exactly. If Facebook can target ads at specific users to begin with, then how can they possibly claim they don't know where the data is or how to find it? Then how do they know the users are being targeted "correctly" or at all? It's a ridiculous argument that falls apart under any scrutiny. Their entire business model relies on them knowing how to target specific users using huge graphs of information.
  
  closeparen 3 years ago
  
  Engineers know how to find a service or a data store that meets a particular need. If extraordinarily skilled, they might even know how to find the best one. Wizards know how to go exploring in the wilderness beyond an API boundary to track down particular problems. A god’s eye view of all services, all data stores, all relationships, and all flows can at best be machine generated. Only extreme zoom levels of tiny regions are of any meaning to a human reader. It’s the difference between going shopping and accounting for all the flows of goods and services in an economy.
- handity 3 years ago
  
  Agreed, Facebook definitely has the ability to provide timestamped interaction events as part of "Download my Data". Maybe legislation will force them to provide that information at some point, but that won't really have any effect on user privacy.
  The hearing seems to want to get more to the heart of the issue, but is inept to do so. The stored data isn't the concern, it's the predictive capability of models built on it. The decision of a predictive model isn't any more traceable to a single piece of data than a brick in a road is traceable to a given taxpayer, which is what I was trying to get at with the analogy.
  After reading some of the transcript, the special masters seem very on the ball, and it's more the spin of the article I take issue with.
- SliderUp 3 years ago
  
  But it is not absurd. Facebook doesn't target ads at you, Bob Jones.
  It takes all of Bob's activities and pushes them into bins based on how they characterize Bob; Male, lives in Iowa, 18-25, etc. a bin (Moves to LA, for example), his data will contribute to different bins. This activity is disconnected from Bob at this point; and the data is aggregated away from single interaction events. "Bob visited foo.com" as a single event is gone at this point.
  The models grind on these aggregate data bins.
  Then when ads are targeted at Males who live in Iowa, aged 18-25 -- those ads get shown to Bob, because he is tagged with those tags.
  They don't "keep track of which webpages you visit", not for more than a day. Those events get pushed in large aggregate stores of activities pretty fast. These aggregate stores are vastly smaller than if you kept all the individual data, hence much cheaper.
anigbrowl 3 years ago

Government contracted construction workers: We Have No Idea Where Your Tax Money Goes
I think it would be a more meaningful analogy if you had posited two government accountants.

jerf 3 years ago

Kind of serves as a good example of "single central conspiracy" versus "many actors given common cause to act in a certain way" ways of thought. It's easy to imagine Zuckerberg going to work every day and laughing maniacally as he personally shepherds your stolen conversation with friends yesterday about your experiences with laundry detergents into The Facebook Info Vault and then personally uses that information to send you ads about Tide laundry detergent, but what Facebook really is by sheer necessity is a whole bunch of agents operating with their own goals. The end result is a massive machine that turns privacy violations into money, but there isn't necessarily a single place where the bad thing happens. In fact you could conceivably be taken on a tour of the system and agree that every individual component is acceptable, or that the vast bulk of them are OK and there's only a single-digit number out of thousands that are problematic.

Nobody could possibly manage Facebook as a centralized single entity, but it's hard to imagine it any other way from the outside.

jkingsbery 3 years ago

While I appreciate this distinction and is an interesting way to look at things, I don't think it's inevitable. I've never worked at Facebook, but I work for another large tech company. Every project at our company, particularly the ones that deal with personal data, must produce a threat model [1] before going into production. In another thread, someone claims "Well that's because the question is meaningless," but from a threat modeling standpoint an engineer needs to understand all aspects of "where" data is for the system under design. Where physically is it stored? What database is it stored in? And knowing that, one must look at what threats are available. Who has access to the data? What other systems have access to the data? How are those users/systems authenticated? Is that access logged?
Now, sometimes systems diverge from the design, and sometimes threat models are incomplete. But the exercise of generating a threat model makes understanding who manages data more manageable.
[1]: https://en.wikipedia.org/wiki/Threat_model
- lupire 3 years ago
  
  > understand all aspects of "where" data is for the system under design.
  There easy! But also useless! The data is everywhere. That's why every "where" was created. Everything is a potential threat vector and needs to be protected by layers of fallible security.
lupire 3 years ago

Not even conspiracy. Governments, free markert, natural world all works the same way. Many semi-independent actors making local optimizations, with varying levels of sophistication in organization and hierarchy. Many emergent properties. IT in the large is like physics, chemistry, biology, psychology anthropology, and sociology. It's not building one precise bridge over and over again.

personjerry 3 years ago

Well that's because the question is meaningless - What does it mean "all your personal data"? What does "where" mean? Physically? Which tables? Which data is relevant? Friends? Friends of friends? Ad data? Behavioural data? ANY inferences, models built on user datasets that might include that user?

They don't even know how to ask the right questions to get to what they want to know.

Context: I worked at Facebook

piva00 3 years ago

All your personal data in the article references data generated by inference from personal data, it was pretty clear in some paragraphs and quotes.
The where part is also alluded to as to mean "systems" from what I gathered.
seanhunter 3 years ago

Companies which deal with personal data do so in the context of regulatory frameworks (CCPA, GDPR etc) which define specifically what the definition is and what’s covered. They also have to record what they collect and process and for what purpose.
If you’re Facebook you have to know this in order to process that data. You can’t pretend not to understand the question.
- lupire 3 years ago
  
  The people who write the question don't understand the question. They are "not even wrong". They are telling the businesses to figure it out. Maybe that's desirable, or maybe it's pointless. In prod strokes you can say data is either fully controlled by the users, shared with other user data for product model building, or also used for ad targeting. But it's unclear if being more specific is useful to anyone. You can point to an example like equal housing opportunity enforcement, as it interacts with ML-targeted or attribute -targeted ads. It's equivalent to detecting discrimination anywhere a human makes an opinionated choice. It's a problem no one knows how to solve. It' not a law like "don't shoot or stab people" (which is also complex in some ways).
gthrone 3 years ago

Ya the way he responds saying a full team would be needed to uncover this sounds like he was treating "where" holistically. Every bit of information in every server, application, etc. Feels like a trained response.
- PeterisP 3 years ago
  
  The way I read this, his response saying that a full team would be needed to uncover this is not a flaw in the question - because IMHO that's exactly what was required - but rather that Facebook has not done the work required to meet their legal requirements. Yes, it's quite plausible that Facebook might need a team to do extensive work to produce the analysis and documentation about where private data is flowing - that's not a valid excuse though, Facebook simply needs to make that team and do that work, no matter if they want to or not, until they can properly answer these questions.
  They need to have an exhaustive list of how they're using private data, and they need to have a process ensuring that their engineers are not adding new sources of private data or not using existing private data without the company approving and updating that list. Yes, their current processes aren't fit for that - as Meta documents quoted in TFA say "We do not have an adequate level of control and explainability over how our systems use data, and thus we can’t confidently make controlled policy changes or external commitments such as ‘we will not use X data for Y purpose,’" - so these processes must be changed.
zo1 3 years ago

I just started reading the transcript. They asked a lot of questions trying to figure out details about various products.
It's 850 pages long, and the few spots I jumped to the guy kept answering with "I don't know", "I'm not familiar with this", etc. Even the parts he says "I'm familiar with this service", the guy goes on to answer "I don't know more than what's written there". The guy couldn't even answer if certain databases were accessible by third parties (not implying they were).
This must have been painful to be a part of. 100 variations of "I can neither confirm nor deny".
alexhill 3 years ago

> They don't even know how to ask the right questions to get to what they want to know.
You're right. If the questions were being asked of someone who was genuinely motivated to answer them, this wouldn't be an obstacle.
The phrase "all your personal data" (or the quote from the article, "every bit of data associated with a given user account") feels vague and imprecise only if you have an internal understanding that for practical purposes, it really is impossible. The Facebook engineers know that and everyone here knows that. It's not literally impossible - an internal team of digital archaeologists with unlimited resources and no deadline might be able to do it for a single user. But practically impossible across the whole userbase.
If you were genuinely interested in providing insight to the hearing, you could explain all that, and then explain how you could go about creating a rough sketch of an answer, and how you'd go about adding some detail to that sketch.
The Facebook engineers have no real motivation to provide insight off their own bat. They can take the question literally, and answer honestly that they don't know, that nobody else knows, and that it's probably impossible TO know. I'm not even sure if this is acting in bad faith. I can easily imagine feeling cagey in this situation, and responding as such. Experiencing an emotion isn't acting in bad faith.
All that said...these questions were asked by a court-appointed subject matter expert. Either the subject matter expert is not the right person for the job, or...and now I realise that of course I haven't read the transcript and there's every chance that what's quoted in the article is not the most insight the court managed to prise out of the engineers. posting anyway in 3 2 1
PeterisP 3 years ago

Yes. All of that. That is the right question, the very first question as a starting point. A basic requirement from a company is to provide an exhaustive, true, up-to-date list of all of these - and if they can't, they should not be permitted to handle private data.
- nathanaldensr 3 years ago
  
  Agreed. I'm tired of these companies assuming that it's some kind of natural order or divine right to collect this data to begin with.
AlwaysRock 3 years ago

>Well that's because the question is meaningless No.
>What does it mean "all your personal data"? It means all your personal data.
>What does "where" mean? Where facebook stores it.
This is only a complicated question if you want to start saying, "Well this isnt really their personal data, we just use it for x..." If its data related to my account or me as a user it's my personal data.
Maybe its difficult to track all that down but certainly that is a challenge worth tackling in a world where GDPR exists.
kwertyoowiyop 3 years ago

Sure, and what does “is” mean exactly? And what really is “data”? Their answers seem like passive-aggressive BS. If their lives or careers depended on it, they’d come up with a lot of answers PDQ.

snowwrestler 3 years ago

There is a difference between what is known and what is knowable.

For example if you asked me now if I know where all the receipts are for my tax-deductible donations, I would say no, I don’t know. Why? There is no reason for me to know right now.

But if the IRS told me I was being audited and must produce those records or go to jail, I would find them. And I have confidence that I could, because I know that in general, those are the sorts of records I keep (somewhere).

Most Facebook employees do not need to know “where personal data is” to operate their business, so they don’t know that. But at the same time, Facebook could not operate if the data was “nowhere” (destroyed), or if it was randomly distributed with no queryable structure.

So the question is not really what they know now, it is: what incentive do they have to go find out.

ridgered4 3 years ago

I recall when Windows 10 came out some company had demanded to know all of the things it sent back to Microsoft over the wire. Microsoft had some guy basically run wireshark and a bunch of network scans while using it and put that in a report. It blew my mind because it implied nobody at Microsoft actually knew what it was collecting and when anymore than your average security researcher, possibly because there are so many sub-entities in the company hoovering it up for different purposes.

kornhole 3 years ago

When I worked in Redmond, they pressed us heavily to collect telemetry where possible. Each line of business basically operated as a separate company with many subsystems and no effective way of connecting customers or users between them. I have been at many big companies and seen similar. If these big companies with expertise and money can't manage their data well, it should be no surprise to anybody how poorly it is handled at smaller companies.

kornhole 3 years ago

Because I suspected this mess, I took a less expedient path to deleting my account. I found a picture of somebody with my same name and made it my picture. I unfriended everybody I actually knew and only kept randos. I liked a bunch of random stuff. I did not actually delete my account until a year later to let the chaotic data propagate into all the subsystems, backups, and partner networks.

LinuxBender 3 years ago

This article reminds me of past audit experiences. It is important to know who to send into the conference room to talk to the auditor. The wrong people can answer too many questions or volunteer the wrong information, leading to more questions and going deeper down the rabbit holes. I suppose the same could apply in this case. It's just a gut feeling.

Delphiza 3 years ago

Many here on HN are saying that such a record of what tech companies do with data is a) impossible because complex systems have evolved over time and the (historical) flows of data are unknown and b) unnecessary because the need to know what happens to data is not relevant to the user or the questioner. The old models of processing data at scale to extract value, as FAANG has been doing, are eventually going to come to an end. Maybe not where there is a power imbalance, such as between Facebook and their users, but at least where users of services have more clout and are paying for the product. I see this in B2B IoT solutions where big customers are very picky about how telemetry is collected by product vendors and, if not pushing back hard, are at least choosing not to use services that are not clear on how data is processed and handled.

The amount of data that we all see, every day, that is grossly mishandled may signal the end of the ML and AI goldrush. You can only build models, and run data through models, with the consent of the owner of that data. Large producers of data (think vehicle fleet operators) are beginning to take ownership of their data, and are only _licensing_ it to processors for very specific purposes. In the example of vehicle fleet operators, they may only want route planning, and not have their data used to sell them tyres based on mileage. Also, while governments may be busy with other stuff currently, at some point they may decide to turn on the regulatory screws.

0xbadcafebee 3 years ago

"The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript."

Oh you sweet summer child.

mathgladiator 3 years ago

I believe we need a privacy centered view of technology. Even a traditional database creates a spiderweb of 'where is your personal data spread across so many tables', and I hope there is a reckoning against this approach.

I have considered pivoting my service into a privacy respecting data store ( https://www.adama-platform.com/ ), but I've yet to meet anyone that actually cares about user privacy to rethink their data layer.

fsociety 3 years ago

Actually FB has some wicked privacy tech around their central data store - and are building more. Nuance gets lost in these articles, for example even with proper documentation there are simply too many things for one person to know where data flows in that company. Or data flowing through RPC calls can make it hard to know what will fail downstream if you remove a certain piece of data.
- mathgladiator 3 years ago
  
  Alternatively, how that privacy logic is built and the infra is an albatross. And worse, a checkbox without any guarantees.

citizenpaul 3 years ago

This is misdirection plain and simple. In the best most convincing way. Get tunnel vision (sorry) nerds to make an honest display of their ignorance of the bigger picture at Meta. Aww shucks we don't know where that data comes from....

Of course senior engineers don't know where the data comes from exactly. Does your mechanic know where each tire comes from or care? Does your fav restaurant chef know each field that a produce came from? The suppliers know that info.

I know beyond any doubt that there are most definitively some people that know at least roughly where most data about most subjects are at Meta. I'm sure the data is a vastly beyond what a human can process. However some people there know where to go looking. The truth is in the Piles of money they generate by selling that data.

Awww gee wizz your honor. We have no idea how we do it. People just keep throwing piles of money at us so we keep taking it. I don't know why they do it. We definitely don't have mountains of highly specialized data on our users that we tell them we can get exactly what they want then provide that info to them. Somehow though when the court asks without piles of money we just can't make that magic happen...........

notacoward 3 years ago

> there are most definitively some people that know at least roughly
"At least roughly" is doing a lot of work here. I think what you're getting at is that there are people who know which storage systems are likely to hold that data, but knowing exactly which objects do is far more challenging. Copies get made while data pipelines are tweaked, and then forgotten. Copies get made internally on storage systems as disks/machines fail and get repaired, as rebalancing occurs, and in other cases that typically only one or two members of the team understand at all. (I worked on one of Facebook's largest storage systems for two years BTW.) Copies get made for AI training, experiments, shadow-traffic generation, etc.
Could they do better? Should they do better? Yes and yes. But it's not a lie to say nobody knows how to find every last orphan and fragment. They can pretty reliably delete everything they could find to sell, but that's a very different thing and you're conflating the two.
- citizenpaul 3 years ago
  
  I'm saying they have vast mountains of data on people. They make enough money from it to be one of the most profitable companies on the planet. This is not happening by accident.
  I'm quite certain that if I showed up to FB with a $50 million dollar ad campaign to run they would have quite the opposite story about their data knowledge. Marching dozens of people into the room promising they can slice and dice every bit of info down to the atom for my needs.
  Yet in court they want to claim they live by the seat of their pants. Its a deception tactic. Even though I'm sure its as you describe partly true in technical concept.
  I do believe that FB is the least privacy invasive of all the big tech giants for what its worth. Nothing against them. I actually think this case is an intentional aggravation to the courts to have regulation created on data collection. That way FAANGM can cement themselves as the offical national data brokers with costly startup regulation for any compeditors.
  
  notacoward 3 years ago
  
  > I'm quite certain
  Based on what?
  > they would have quite the opposite story about their data knowledge
  As I said, they can probably delete everything they could sell. What's left are fragments, which are still a real concern but not enough to use for ad targeting and certainly not enough to satisfy a big-money customer. They might take that $50M anyway, but it would be under false pretenses and that's a very different issue.

tru3_power 3 years ago

There seems to be a pretty concerted effort to paint Meta as worse than the other big ad-tech companies. Maybe it’s just the contrarian in me, but I’ve noticed a way big uptick in these stories over the year.

pavlov 3 years ago

> The systemic fogginess of Facebook’s data storage made answering even the most basic question futile. At another point, the special master asked how one could find out which systems actually contain user data that was created through machine inference.

“I don’t know,” answered Zarashaw. “It’s a rather difficult conundrum.”

This is not a basic question because of how ML works. Can we say a system contains my data if my data was used as a training input to a neural network at some point in the past?

In that case, can I sue anyone using Stable Diffusion for stealing my data because the billions of images in its training set included something I created?

dannyw 3 years ago

> In that case, can I sue anyone using Stable Diffusion for stealing my data because the billions of images in its training set included something I created?
No, because of the landmark Authors Guild, Inc. v. Google, Inc. case which found that ML training can be considered fair use in some circumstances.
- pavlov 3 years ago
  
  So why do we expect Meta to internally track this kind of data usage? Seems like it’s both fair use and covered by the terms of service that FB users have agreed to.
  
  kmeisthax 3 years ago
  
  Copyright and privacy law are two different animals.
  Copyright law covers things that are expected to be published - i.e. creative works. We decide that authors get to control how and when they are published. Privacy law is different: the subject is information that is held in confidence and should not be published under any circumstance.
  Fair use would circumvent the "copyright my nudes so I can DMCA them off Facebook" kind of action, but not, say, the GDPR. GDPR does not care about fair use, and its skeptical of TOS agreements.

salawat 3 years ago

I would not be surprised at all. I can count on one hand the number of Engineering candidates that given a system description involving three visible networked laptops in view of the camera have answered "where is the data, right now?" , or can at least give a decent stab at "trace the datapath through the machine".

I can't explain somehow this is the average candidate, yet somehow... Life goes on.

notacoward 3 years ago

The basic problem is one of tracking (or not tracking) copies of data. Your information at Facebook is stored primarily on several storage systems, but extra copies might exist in backups, data-analysis pipelines, logging systems, etc. At the time I left (two years ago), there was a cutely-named project well under way to ensure deletion of those primary copies and (IIRC) backups. Fragments still in data-analysis systems might still exist, but they're also less personally identifying. I don't recall the state of anonymization and provenance tracking that would allow even these remnants to be found and purged for certain. So we're basically talking about two different questions.

(1) Is it possible to be sure that the primary copies and backups are gone, so that finding anything that's left would require some very specialized knowledge and/or an infeasibly massive scan? I believe the answer to this is probably yes at this point.

(2) Is it possible to be sure that absolutely every last vestige of the person's time on Facebook is gone? I believe the answer to this is still probably no, and likely to remain so for some time. At the very last, some artifacts will remain in those opaque AI models.

I suspect the same two answers exist at many companies. What is HN's data retention policy? Oh, oops, none of my data would be deleted if I left. I suspect that Google and/or Apple, maybe Microsoft as well, are a lot closer to "yes" on that second question, but even then I suspect gaps appear from time to time.

I say this not to condemn or defend anyone, and I know companies under stricter regulatory regimes can give more definite answers. It's just the state of the art at Big Tech companies as I understand it.

canoebuilder 3 years ago

What are we talking about here? Data associated with specific users?

Well in order for your grandma to log into Facebook her user account must have a primary key associated with it so she sees her info when she logs in and not someone else’s.

We are talking about computers and databases. When did using a computer to search a database become a difficult, nigh impossible thing to do?

Even if design documents and flow charts or whatever don’t exist could they not fairly straightforwardly be reverse engineered by taking a sample of users and searching all databases for associated information?

This seems like a transparent ploy on the part of Facebook to avoid regulation by casting the perfectly doable, searching a database, into some sort of incomprehensible impossible task. The credulous author of the article and many comments here seem to be strangely buying into it.

FB investors don’t seem to have lost faith in the company’s ability to search databases. When it comes to making the company money from those searches, no problem at all. But regarding lawsuits or potential regulation, we can only throw our hands into the air and simply wonder and gasp at this incomprehensible mess we have made.

ctide 3 years ago

It became difficult when your 'database' turned into exabytes of random files in s3.

ElijahLynn 3 years ago

The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript. Zarashaw responded: “We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.” He quickly added, “For what it’s worth, this is terrifying to me when I first joined as well.”

“We do not have an adequate level of control and explainability over how our systems use data, and thus we can’t confidently make controlled policy changes or external commitments such as ‘we will not use X data for Y purpose,’” the 2021 document read.

egberts1 3 years ago

Not surprising that the same can be said for a certain US government department having hosted a personal mail server in a bathroom closet and trafficking classified materials and caught much later by OIG.

Or that one time I found a dialup modem under the raised floor and attached to the Equifax (then TRW) mainframe.

Or catching someone inserting a USB stick into a PC inside a secured white-lab area.

Absolutely crazy times for large entities.

riddleronroof 3 years ago

Isn’t this a good thing? Every Facebook employee should not know where all PII is. Further, it doesn’t say what data they were looking for. Eg Around the time of the scandal, lots of Facebook quizzes were asking for PII. but it was not stored at facebook. I’m not a Facebook fan, but I’m not a fan of unfounded scary stories either.

mrweasel 3 years ago

Besides the legal issues, isn’t there also a technical problem? Facebook is potentially storing the same information multiple times, or missing opportunities, because a engineer did know that some data was available.

I guess storage is cheap enough, but even for Facebook, being able to save say 10% on storage must mean something.

jtbayly 3 years ago

> The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript. Zarashaw responded: “We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.” He quickly added, “For what it’s worth, this is terrifying to me when I first joined as well.”

The cynic in me wants to say that not documenting certain decisions is a common method of hiding immoral decisions.

namelessoracle 3 years ago

The cynic in me says it's building your own job security. Documentation only really gets created if leadership forces the issue in my experience. You also get to sell the refactor/rewrite because "nobody understand the existing code".
- 988747 3 years ago
  
  I've been a programmer for 15 years and I've never seen it done for "job security". I think this is mostly about low tolerance for boredom of most programmers: writing code is exciting, documenting it is boring, so you do not do it unless forced to.
  
  namelessoracle 3 years ago
  
  Ive seen people "neglect" to offer training peers or document things before to protect their own position. Anecdotes and all that.
  I mean people "usually" wont say that out loud of course.
roland35 3 years ago

Trust me, at (most) big tech companies there isn't anything that nefarious going on at the engineer level. In fact everyone basically agrees that improving documentation is important, but it constantly gets swept aside in the constant churn of new features, fires to put out, and frequent meetings. Information gets spread out across various docs, wikis, readmes, etc.
What ends up happening is each part of the system has a "point of contact" who is the go to for that piece. Lots of decision history is in the heads of team elders. Hopefully they are still around!
adamsmith143 3 years ago

So FAANG is held as the gold standard in the industry and this is their best practice?
>We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.”
laweijfmvo 3 years ago

Imagine that someone asks you to modify a service; so you study the code, make the change, test it, and push it. Now imagine they ask you to update all the docs/wikis to reflect the recent changes, or more realistically you just create another one and store it wherever you feel is appropriate.
In my experience there’s a ~10% chance an up to date piece of easily discoverable documentation exists for any service or component, and there’s a ~90% chance at least one out of date document exists. The later seems much more harmful, like when you realize some PM has been referencing an out of date doc for the last 9 months.

hk1337 3 years ago

I hear /dev/null is a good place for large amounts of personal data.

y42 3 years ago

I assume it's the same in every company after a particular amount of time. They start clean and then everything grows organically because you spend more money in innovation than cleaning up and documentation. It's not even a thing of competition or capitalism, but limited resources.

IncRnd 3 years ago

> Zarashaw responded: “We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.”