676 points by lelf 6 months ago
Years ago (2010 iirc), I reported to the Postgres list, with a patch, that (one of) the reason Postgres “didn’t work on NFS” was because it couldn’t deal with short writes. I got told the usual “you’re holding it wrong” instead of an acknowledgement of PG not sticking to the spec.
I patched my own systems (wrap the calls in a loop as per standard practice) and then proceeded to run literally hundreds of thousands of PostgreSQL instances on NFS for many more years with no problems.
The patch was eventually integrated it seems but I never found out why because I lost interest in trying to work with the community after that experience.
This saddens me.
Looking at this thread - the patch is obviously right. It doesn't matter about NFS, Linux, or any host of random crap people brought up.
(I unfortunately suspect if you hadn't mentioned any of that, and just said "write doesn't guarantee full writes, this handles that" it may have gone better).
The documentation (literally on every system) is quite clear:
"The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program should always call write in a loop, iterating until all the data is written."
Literally all documentation you can find about write on all systems that implement it clearly mentions it doesn't have to write all the bytes you ask it to.
Instead of saying "yeah, that's right", people go off in every random direction, with people arguing about your NFS mount options, what things can cause write to do that and whether they are possible or they should care about them.
 spoiler alert: it doesn't matter. The API allows for this, you need to handle it.
I'm a bit confused by this perception of the thread. There was one person doubtful the patch is the right approach (Tom), but still offered concrete review comments (of actual issues that'd at the very least need to be documented in code comments). Others agreed that it's something that needs to be done.
> Instead of saying "yeah, that's right", people go off in every random direction, with people arguing about your NFS mount options, what things can cause write to do that and whether they are possible or they should care about them.
Given that changing the mount option solved the immediate problem for the OP, I fail to see how it's random.
>  spoiler alert: it doesn't matter. The API allows for this, you need to handle it.
Sure, it's not like the return value was ignored before though. I agree retrying is the better response, but erroring out is a form of handling (and will also lead to retries in several places, just with a loop that's not tightly around the write()).
> Given that changing the mount option solved the immediate problem for the OP, I fail to see how it's random.
It's random because pretty much anyone could volunteer that opinion. By diagnosing the problem, making a patch and reaching out the author has already made their intention of fixing the problem clear. Unless it has been established that the problem isn't valid a workaround isn't really relevant.
Effective communication and exchange of ideas needs to have a good ratio between work and value. I used to frequent meetups where people would present projects with thousands of hours or work and priorities behind them. There would almost always be someone with less experience stating their "ideas" on what should be done instead. Eventually those people would end up talking among themselves where their ideas could flow freely without any restriction of actual work being done.
Experienced people provide value. They try to understand the problem, add their own experience to it and validate the work that has already been done. They make the problem smaller and closer to a solution. They don't, or shouldn't, casually increase the scope for little reason.
> It's random because pretty much anyone could volunteer that opinion. By diagnosing the problem, making a patch and reaching out the author has already made their intention of fixing the problem clear. Unless it has been established that the problem isn't valid a workaround isn't really relevant.
I'm baffled by this. Even if the fix had been immediately committed, the workaround of using nointr still would have been valuable, because a fixed version of postgres wouldn't immediately have been released.
You seem to argue in a way that entirely counteract your own later comments.
If someone spends x amount of work hours on something, that is what they want feedback on. They aren't looking for quick suggestions on other paths to explore. It isn't a brain storming session. It is work done being represented by an e-mail, code or a product. You are being presented with their theory for a solution. At some point something else might be relevant, but that isn't something to assume. The assumption should be that the person presenting have made their choices based on their situation.
It is often the same with software. Good feedback on software isn't random ideas, suggestions or feature requests that adds hundreds of hours of work on a whim. It is feedback that considers the work that has already been done. Anyone can come up with something else, especially in theory and with a blank slate. It doesn't really require anything other than an opinion. Hacker News certainly is proof of that.
Sure. Postgres misses a bug for years because they can't give even the most basic attention to someone on a mailing list, you know were you supposedly talk about things like bugs, and then defend the whole thing with excuses, rhetoric and further useless assumptions but I am the bad guy. This is exactly why people don't bother. Experience is just a liability these days in the "community". Companies maintain their own patches and conversations happens privately. Just sad to see such potential for accessibility wasted on arrogance. I guess it isn't really surprising though when everyone different already left.
You're basically just making shit up at this point. The most critical message in that thread contained pieces like
> You're basically just making shit up at this point.
You keep making my point. All you have is defensive, deflecting and rude comments. If you disagree with something it isn't hard to say "I disagree that x is y because z", like anyone interested in a discussion would. I will keep this in mind when interacting with the community, commissioning work or buying services related to Postgres comes across my desk.
Turns out people don't like one baseless accusation after another. I provided you with concrete references where review comments where made, where people agreed it was necessary. Yet you deny that happened, without refuting anything concrete.
Note that you're the one hiding behind at least the third sockpuppet account.
There is really no point in carrying on this conversation. I have made a number of argument and you are the one who have accused me of things. First I was contradicting myself, then making shit up and now of hiding behind sockpuppets. I haven't denied anything, you singled out a single sentence of my comments where you could snipe at me. The overall point, the discussion, and the repeated arguments I have made still stands.
I am not hiding anymore than any other anonymous account and for no other reason than time management and privacy. I shouldn't frequent hacker news for reasons previously stated, but sometimes I tell myself that the decent thing is to reply. Especially as many people won't. Now it is clear that this discussion isn't going anywhere so I won't be creating any other accounts or comments. To prevent any ambiguity this won't change.
No one suggested having an anonymous account is an issue. The issue is the account being a sockpuppet one.
> I agree retrying is the better response, but erroring out is a form of handling
In this case it is “a form” but that specific handling is provably wrong. That is important: everybody could have tried and proved again. Therefore just dismissing the correct handling and keeping the wrong one is, let me repeat, also provably wrong.
They were not ewview comments. They were arguments about where it ends. He's totalyl unwilling, and says you shouldn't run of of NFS. But then other people are, and are talking about mount options, when the problem clearly lies in the code not following the specs.
> 1. If writes need to be retried, why not reads? (No, that's not an invitation to expand the scope of the patch; it's a question about NFS implementation.)
> 4. As coded, the patch behaves incorrectly if you get a zero return on a retry. If we were going to do this, I think we'd need to absorb the errno-munging currently done by callers into the writeAll function.
not review comments?
> when the problem clearly lies in the code not following the specs.
There were plenty of questions about how exactly the fix should look like below
How's that not review?
You had me until you tried to justify bubbling the error instead of handling it at its source while possible.
That can be a correct response, for instance, if you have some sort of async i/o in place so you don't want to wait for a write that's not ready because you'll block.
Hm? I explicitly said that retries would be better?
If “write” expects you to always call it in a loop, I wonder why it wasn’t simply implemented so that it calls itself in a loop in the first place. Or at least provide a “writeAll” wrapper function.
What if the disk is literally unplugged after a partial write? (or the network cable unplugged, etc.)
Callers would still need to be informed that a partial write occurred and would need to do something about it.
When I call write(), it is typically already inside a bigger loop. If the system can't write my whole buffer right now, then I am happy to get the CPU back so I can do more work before trying again (and now I might have more data for it too).
This is literally the case study from worse is better :)
Actually as great as Postgres is and as generally approachable the community is - my experience was the same a few times and I read it on the mailing list happening to others:
Someone comes along with a patch or idea. Bunch of big Postgres people come knock it and it dies right there.
Happened to me when I suggested a more multi-tenant approach back around 2010 and today we have Citus. I was told that (paraphrased) no users were asking for that sort of thing.
I see it kind of happening with the Foreign Key Array patch that the author asked for help to rebase and no one bothered to reply.
Someone suggested replacing the Postmaster with a threaded approach so it could scale better (and showed benchmarks of their implementation handling more connections). Community response was there were already 3rd party connection pools that do the job. An outsider looking in considers this nuts - most people would not run additional components if they did not need to!
Another example: NAMEDATALEN restricts every identifier to 63 bytes - a limit that causes frequent issues (more so now that we have partitioning) and also a problem for multi-lingual table names. It’s been proposed to increase this limit a few times. Every time the answer has been: abbreviate your table names or recompile your own version of Postgres.
Could name a few other examples too I’ve noticed over the years and just sighed at. I don’t expect every idea to be accepted or even welcomed - but there is that sense of bias against change.
While this can and does suck when it happens to you, this is exactly what it takes to keep products focused so they don't die death by a thousand cuts (or feature requests). For every awesome feature embarked upon, there's an opportunity cost of bug fixes, stability updates, tech debt reductions, and other inglorious but necessary work. Aggressively de-scoping is the difficult but necessary work of keeping a product alive in a competitive marketplace. And yes, it's a marketplace even if the product is open source.
Yep, I think it’s an unfortunate side effect of dealing with an onslaught of bug reports, many of which are user error or someone else’s problem. It’s common in any kind of user support, you start seeing the same report over and over.
I even saw something similar when I went to the ER recently. Even doctors will pickup on one thing you say and draw conclusions from that and dismiss everything else.
"Even doctors will pickup on one thing you say and draw conclusions from that and dismiss everything else."
This pattern seems really common, and is what scares me about the future in general. The 'experts' concentrate on the stuff they understand / are best to the detriment of where the actual focus needs to be. In a lot of cases, this is despite the insistence of the supposed non-expert who is the one suffering as a result.
Some of the worst cases are as a child, where you get in trouble twice. First for wasting adults time because you didn't tell them properly, and then again for pointing out that you did.
I've had this experience with doctors so often it is chilling.
Elderly relatives writhing in pain, only to have doctors say it's indigestion (it was a perforated ulcer and an uncle had previously died from a wrongly diagnosed ulcer perforating). My partner was misdiagnosed with flu when it was pneumonia which then developed into pleurisy (I'd never seen either of the latter, but was telling the doctor that's what the symptoms looked like - 15 years later he still suffers pain from the pleurisy). I had an arm paralyzed through severe pain and the consultant doctor planned an operation "to cut the nerve" - I said I thought it was a frozen shoulder and that such a procedure was unnecessary; 6 months later the paralysis began to subside and the consultant agreed it was a frozen shoulder). Another relative died of bowel cancer that was said to be back pain (she died in the hospital where she worked). I know of several people who were telling the doctor they had cancer, only to have the doctors dismiss it as trivial, with most of these people dying because of their untreated cancer. As a child I had joint pains for years that were diagnosed as "growing pains" but turned out to a hip disease (younger cousins ended up with the same condition and because I'd already had it, they were more readily diagnosed by family members).
In both directions (treating trivial as serious and treating serious as trivial) I've seen so many mistakes. I'd be much happier to see a doctor google the symptoms rather than jump to a conclusion about what is wrong.
There's a famous anecdote where junior doctors are taught the importance of observation, by senior doctors tricking them into tasting urine. It doesn't seem to be a lesson they learn. Even when their own objective test results are contra-indicative of their pre-judgement I've seen doctors scratch their heads but stick with their incorrect pre-judgement.
When doctors I know have a family member go into hospital, you should see how attentive my doctor friends get concerning what is being said and done to their relatives. Some doctors will not even allow relatives to go into hospital for non-emergency treatment at certain times of year (because of timetabling there can be very inexperienced doctors on duty at certain times of the year).
As someone who suffered through that on _both_ shoulders I can sympathize. For me, the doctor missed it. The chiropractor I was sent to, took one look and said it was 'frozen shoulder'. I have never even heard of such a thing before. It took nearly two years to get full movement on my right shoulder. Then the left froze :-(
I'm just curious. Are these experiences in the UK?
I have seen and heard of some similar things, but my experience is only with the US healthcare system.
I’m a complete outsider to PostgreSQL specifics, but: if that is the reason, then this is a case for improving messaging.
If I’d been in the patch submitter’s shoes, I wouldn’t have thought twice about writing off that community.
If I had gotten your reply, instead: 100% fair play, thank you for your consideration, and send my love to the dev team.
I somehow responded with this to the wrong post earlier:
Honestly, if someone's spending time working on a high value open source project (which PG absolutely is), I'd rather they spend less time (than I do) crafting their internet comments to sound nice and more time contributing to society. And I hope people who actually use the product feel the same way, understand why every single use case can't be carefully considered every time it comes up, and don't take it personally.
But messaging influences public perception influences inclination of future potential contributors to participate influences quality of the software. As we see here.
People like to think they can escape politics. They can’t. Any group of >1 humans will involve politics.
Learning how to be respectful and polite is like learning how to touch type: it’s a small, painful price to pay once, for a lifetime of copious reward.
The less value you provide, the more important politics are. If you aren't doing much, then you damn well better have great messaging. If you're doing a lot, and people are banging on your doors to get what you're selling or giving away, then don't waste your time being polite. Nobody cares. They just want you to do what you do.
> Nobody cares.
Except georgebarnett and jarym.
Or perhaps that was the point, that PostgreSQL isn’t doing much! :)
> I patched my own systems (wrap the calls in a loop as per standard practice) and then proceeded to run literally hundreds of thousands of PostgreSQL instances on NFS for many more years with no problems.
> Actually as great as Postgres is and as generally approachable the community is - my experience was the same a few times and I read it on the mailing list happening to others:
It seems that despite things not being perfect (and they never are), these people are still having positive experiences with the product and the community overall. So, to restate, some people care enough to post about it on HN, but nobody (who continues to use PG, and there are a lot of us) cares all that much.
I disagree that this is required. You can see from Linus Torvald's backtracking on decades of abrasive behavior that it was never an important part of Linux after all, so an abrasive experience for people trying to help other open source projects is probably going to be superfluous too. You can still reject ideas without disregarding them or the person.
"We don't support running Postgres on NFS" isn't the same thing as "fuck you Intel ACPI team; you're dumber than a chimpanzee". Equating disagreement and criticism with Linus-isms is why the relationship between users and developers is such a mess to begin with. Being a maintainer requires you to say "no" sometimes, but it doesn't require you to be a jerk.
No, it's worse.
Linus was trying to make things work, with profanity. Postgres couldn't be bothered.
Sure, performative profanity isn't everyone's cup of tea, but milquetoast passive-aggressive dismissals of people like OP who ARE TOTALLY RIGHT aren't actually nice.
Please cite which parts of the thread you're referring to. Again, several people +1'd the idea, and reviewed the code.
> You can see from Linus Torvald's backtracking on decades of abrasive behavior that it was never an important part of Linux after all
It's ten years too soon to conclude that Torvalds backing away from abrasive behaviour didn't kill Linux.
I’m not sure if that logic holds. Who’s to say Linux would not have been more or less successful if Linus had behaved differently? For all we know, Linux may have succeeded despite his behavior, rather than because of it.
That said, I feel that a strong and positive community around a project is always an asset. I’ve seen many more projects fail due to community interaction being bad than I have from it being good.
Are you sure you're disagreeing with the post you responded to? It sounds like you entirely agree with them, but you open as though you disagree...
I think he just got the last two phrases backwards :)
The patch author is doing the work. Telling them “no” isn’t going to make them focus instead on the project leadership’s other priorities like it would in a corporate team.
It's not at simple as that.
The patch author does the "initial" work. But then it's up to the team to learn the patch, understand it and keep maintaining it.
Every line of code is baggage.
If there is no demand for something at the time, it makes sense for maintainers to reject that. It's up to them to maintain that patch from now on.
There is another scenario, I submitted a Pull Request to a OSS project, the authors discussed it and rejected it and then implemented it in the exact same way as I did. That was hurtful.
I can't upvote your comment for some reason, but exactly this
>Telling them “no” isn’t going to make them focus instead on the project leadership’s other priorities like it would in a corporate team.
No one is asking them to. An open source project is not a corporation, it has no shareholders who require growth at all costs. So someone doesn't contribute to the project, if there's enough other contributors to keep it healthy then who care? No need to try and get every single potential contributor to contribute code to the project.
You can’t pay an opportunity cost on an opportunity you don’t have. Whatever the reason for rejecting the patch in this case, it is not a missed opportunity to work on bugs and tech debt as suggested by the parent.
> crafting their internet comments to sound nice
It's not hard.
One thought. Perhaps it comes easier to you than it does to other folks. We all have different skills, and sometimes what looks like malice is really just incompetence. I find that assuming positive intent adds more value to my life than doing the opposite.
A second, separate thought. Perhaps it's not quite as easy as you think it is, and perhaps you're not as good at it as you think you are. Case in point, the comment you just made was not exactly the nicest I've read today.
IMHO it's way more complicated than it seems, because of a mix of technical, cultural and personal reasons.
- technical: E-mail is not a particularly good medium to convey emotions (in either way). For example someone with a naturally terse communication style may be perceived as harsh, while in person he's actually a very friendly bloke. And there's no way to communicate this impression back.
- cultural: Often what is quite polite in one culture may be seen as quite impolite or even rude in another. It's not just country vs. country, but even region vs. region (like for example East Coast vs. West Coast).
- personal: People in the community may know each other pretty well, in which case the communication style may be quite a bit more direct. But others may lack the context.
It's almost impossible to get it right all the time, without resorting to entirely mechanical corporate communication style. Which is not fut at all.
Of course, this does not mean there are no truly harsh / rude / WTF posts (even on PostgreSQL lists). But I'd say most of the time it's not meant that way.
Yea, that's definitely something that happens. Partially due to some conservatism, partially due to having way more patches than review bandwidth, ...
> Another example: NAMEDATALEN restricts every identifier to 63 bytes - a limit that causes frequent issues (more so now that we have partitioning) and also a problem for multi-lingual table names. It’s been proposed to increase this limit a few times. Every time the answer has been: abbreviate your table names or recompile your own version of Postgres.
I think if it were easy, it'd immediately be fixed, but there's a fair bit of complexity in fixing it nicely. For partially historical and partially good reasons object names are allocated with NAMEDATALEN space, even for shorter names. So just increasing it would further waste memory... We're going to have to fix this properly one of these days.
Re NAMEDATALEN specifically - you at least acknowledge it eventually needs fixing. On the mailing list there were a lot of important Postgres people who don’t seem to agree.
Agreeing that there is something worthy of fixing is a first step. It should have happened with this NFS patch and imo some other stuff. The considerations for how and when should be dear with separately.
> Agreeing that there is something worthy of fixing is a first step. It should have happened with this NFS patch and imo some other stuff. The considerations for how and when should be dear with separately.
But there were like multiple people agreeing that it needs to be changed. Including the first two responses that the thread got. And there were legitimate questions around how interrupts need to be handled, about errors ought to be signaled if writes partially succeed and everything. Do you really expect us to integrate patches without thinking about that? And then the author vanished...
> Do you really expect us to integrate patches without thinking about that?
Whoa! You made huge leap there. At what point was it suggested that patches be recklessly applied?
That didn't happen. Your quote actually suggests a reasonable progression and at no point is there any suggestion, implied or otherwise, that changes be integrated without due consideration.
Not irrationally dismissing criticism != abandoning sound design and development.
Multiple people on this thread have expressed the opinion that the patch was "too perfect" or that the only reason that the patch wasn't simply accepted was that the maintainers "feel weird when there's nothing to criticize".
Ugh. I don't know a single PostgreSQL committer with that attitude. Maybe my patches are too crap to be in that situation, not sure.
"Multiple people" were left with nothing better to which to attribute maintainer decisions. See the dysfunction there? Don't address the problem, or the proposed solution; just leave it to fester and create discontent.
That's not how well run projects function.
Well, that's the thing - changing NAMEDATALEN is a seemingly small change, but it'll require much more work than just increasing the value. Increasing the value does not seem like a great option, because (a) how long before people start complaining about the new one and (b) it wastes even more memory. So I assume we'd switch to a variable-length strings, which however affects memory management, changes a lot of other stuff from fixed-length to variable-length, etc. So testing / benchmarking needed and all of that.
Which is why people are not enthusiastic about changing it, when there are fairly simple workarounds (assuming keeping the names short is considered to be a workaround).
> (a) how long before people start complaining about the new one
Very likely many years, or even never. People don't use large names because they like it, they always prefer small ones.
How much memory are we talking about?
> Very likely many years, or even never. People don't use large names because they like it, they always prefer small ones.
Well, we don't have exactly a barrage of complaints about the current limit either.
> How much memory are we talking about?
The thing is - it's not just about table names. NameData is used for any object name, so it affects pretty much any system catalog storing name. A simple grep on the repo says it affects about 40 catalogs (out of 80), including pg_attribute, pg_class, pg_operator, pg_proc, pg_type (which tend to be fairly large).
So the amount of additional memory may be quite significant, because all of this is cached in various places.
Yea, I think pg_attribute is likely to be the main issue here. For one, it obviously exists many times per table, and there are workloads with a lot of tables. But also importantly it's included in all tuple descriptors, which in turn get created during query execution in a fair number of places. It's currently ~140 bytes, with ~64bytes of that being the column name - just doubling that would increase the overhead noticeably, and we already have plenty of complaints about pg_attribute. I think it'd be fairly useless to just choose another fixed size, we really ought to make it variable length.
Is it ~140 bytes? pahole says it's 112 (without CATALOG_VARLEN).
The impact of doubling NameData size would be quite a bit worse, though, thanks to doubling of chunk-size in allocset. At the moment it fits into a 128B chunk (so just ~16B wasted), but by doubling NameData to 128B the struct would suddenly be 176B, which requires 256B chunk (so 80B wasted). Yuck.
> Is it ~140 bytes? pahole says it's 112 (without CATALOG_VARLEN).
Well, but on-disk varlena data is included. pg_column_size() averages 144 bytes for pg_attribute on my system.
> The impact of doubling NameData size would be quite a bit worse, though, thanks to doubling of chunk-size in allocset. At the moment it fits into a 128B chunk (so just ~16B wasted), but by doubling NameData to 128B the struct would suddenly be 176B, which requires 256B chunk (so 80B wasted). Yuck.
I'm not sure that actually matters that much. Most attributes are allocated as part of TupleDescData, but that allocates all attributes together.
> Well, but on-disk varlena data is included. pg_column_size() averages 144 bytes for pg_attribute on my system.
Sure, but I thought we're talking about in-memory stuff as you've been talking about tuple descriptors. I don't think the on-disk size matters all that much, TBH, it's likely just a tiny fraction of data stored in the cluster.
> I'm not sure that actually matters that much. Most attributes are allocated as part of TupleDescData, but that allocates all attributes together.
Ah. Good point.
> I don't think the on-disk size matters all that much, TBH, it's likely just a tiny fraction of data stored in the cluster.
I've seen pg_attribute take up very significant fractions of the database numerous times, so I do think the on-disk size can matter. And there's plenty places, e.g. catcache, where we store the full on-disk tuple (rather than just the fixed-length prefix); so the on-disk size is actually quite relevant for the in-memory bit too.
> Could name a few other examples too I’ve noticed over the years and just sighed at. I don’t expect every idea to be accepted or even welcomed - but there is that sense of bias against change.
not just bias against change. while there are some very talented and friendly people in the pg community, there are a few bad actors that are openly hostile and aggressive, that feel that they must be part of every discussion. it gets worse when it is a change that they made that is causing an issue, as ego takes a front seat.
unfortunately, these few bad actors make dealing with the pg mailing lists in general very unpleasant, and have made myself (a popular extension maintainer) and others try to keep interaction to an absolute minimum. that's not good for the community.
I'm honestly curious who are those bad actors, and examples of such behavior. I'm not trolling you, I really am curious - because that does not match my experience with the community at all.
I'm sure there were cases of inappropriate / rude communication, but AFAICS those were rare one-off incidents. While you're apparently talking about long-term and consistent behavior. So I wonder who you're talking about? I can't think of anyone who'd consistently behave like that - particularly not among senior community members.
I know that you read the mailing lists, just spending a week on bugs, or following any major or minor discussions on hackers should be enough for you to figure a couple of them out, but I'm not going to personally call anyone out by name on hacker news.
my comment, which is based on the multiple interactions I've had with the community, stands as is: some fantastic people, and a few aggressive bad actors that spoil things.
I don't think this is really limited to Postgres. Any relatively big OS project the core committers they tend to do what they do. I submitted a patch to a pretty big OS project and pinged the person assigned to the area. That person pretty much wrote his own version of my patch. I guess it got fixed so that was good. I guess the good thing about open source is if you want you can just build your own version with whatever you want. However smaller projects are more receptive since they want all the attention and help they can get.
I think the "big project" angle is that maintainability is of higher priority than for smaller project. So more minor things need to to be addressed than in smaller projects. And that then either leads to being very nitpicky in reviews (which costs both sides cycles), or polishing patches pretty liberally (which costs the submitter experience, but reduces the time required for both sides).
> Someone comes along with a patch or idea. Bunch of big Postgres people come knock it and it dies right there.
This is not unique to Postgres. I've seen this behavior on many development mailing lists (e.g. Mutt-dev).
Well, just proposing things and waiting for the team to implement them is easy.
The core team has also to prioritize, deal with already planned features, and shoot down tons of people with inane ideas as well (not just good ones).
Yes 63 Identifier limit is a real pain, especially for constraint names, where a long name can convey a lot of valuable info.
NOT IN FKs ie opposite of current FK’s for IS IN, would be useful for XOR constraints I would guess they could be implemented quite easily using existing FK infrastructure.
Not the worst thing in the world. I have trouble reading code with very long identifiers. Conventionally, a line of code should not exceed 80 characters or so. That's pretty hard if identifers are 60+ characters by themselves. If your project has standard abbreviations for things, "63 characters ought to be enough for anybody"
I had a project last year where DB names were autogenerated per tenant, with the tenant ID being a UUID, so you're left with 63-36=27 characters. Starts to feel narrow.
(Putting tenant_id as a column was not an option because for each tenant, a third-party software was started that wanted to have its own DB.)
> and today we have Citus.
After reading this, I have been wondering if the other requests/ideas are not startup ideas
I'm sure some of them are. And Citus is not the only startup pushing PostgreSQL in a way the community did not want to.
If you look at https://wiki.postgresql.org/wiki/PostgreSQL_derived_database..., it's a pretty long list - some of the product are successful, some are dead. And then there are products adding some extra sauce on top of PostgreSQL (e.g. timescaleDB).
I kind of wish the pg community came and read this. I guess this is one of the reason why MySQL were much more dominant in the early days and pg had relatively small usage.
IMHO there were other / more important reasons why MySQL was initially more successful.
The relevant thread:
There's literally one person saying that NFS is crapshot anyway, and several +1s for retries. And the former was accompanied with several questions about the concrete implementation.
One the messages told me to go lobby NFS vendors.
Many others were as you note bikeshedding the commit.
Only one, if I recall, questioned why it was a controversial patch at all.
If you check the patch that was committed eventually, IIRC it’s identical to what I proposed.
As I said, I don’t look back particularly favourably on my interactions with the community.
I feel this is a little unfair to the community and I encourage others to read the thread before forming a negative opinion.
It seems like a fairly good example of an engineering discussion. Even if the patch is correct and conforming and solves a problem, that doesn't mean there is no use for further discussion, and the surrounding discussion does seem to have merit.
* Are there closely-related problems still left unsolved (e.g. retrying reads)?
* Is something not configured according to best practices known by other community members?
* Expressing an opinion that the use case is dangerous, so that nobody (including other people reading the thread in the future) will take it as an endorsement for running postgres on NFS.
* Some legitimate-sounding questions about the specifics of the patch (around zero returns, what writeAll handles versus its callers, etc.).
* At least one person agreed with you that it's a reasonable thing to do.
I don't think HN is really the right place to assess the technical merits of a postgres patch, but the discussion itself seems well within reasonable bounds.
Which of the messages on the thread do you consider as bikeshedding? I personally don't see any - that's not to say I agree with everything said on the thread, but overall it seems like a reasonable discussion.
IMHO it's a bit strange to assume everyone will agree with your patch from the very beginning. People may lack proper understanding of the issue, or they may see it from a different angle, and so on. Convincing others that your solution is the right one is an important part of getting patch done. I don't see anything wrong with that.
Looking at your submission, it was very good: a clean way to reproduce, reference to the specification that tells users of a syscall how to use it. I'm perplexed that you got so much pushback on this when it seemed pretty straight-forward. I'd guess that this was your first patch to postgres and the default reaction is to be defensive and decline the patch.
Thinking about the social structure for a minute, I honestly think you might have done better to leave out some of the content of your patch submission, let people push back on the more obvious stuff, and have ready answers. There's a phenomena where curators feel they must push back, and they feel weird when there's nothing to criticize - you can get around this by giving them something to criticize, with a ready-at-hand response!
Sorry this happened.
Somewhat OT, but that is generally referred to as a duck, and it _definitely_ has its uses.
> The artist working on the queen animations for Battle Chess... did the animations for the queen the way that he felt would be best, with one addition: he gave the queen a pet duck. He animated this duck through all of the queen’s animations, had it flapping around the corners. He also took great care to make sure that it never overlapped the “actual” animation.
> As expected, he was asked to remove the duck, which he did, without altering the real Queen animation.
I had an American friend who lived in Germany for many years who had a similar approach to dealing with immigration officials. He learned quickly that no matter how thorough he was when applying to extend his stay, his packet would always be "missing" some form. So he started leaving a few forms out of his packet, that way what they asked for was something he already had.
Ooh, you know it seems like you could use that method to misdirect the official from noticing a forged document.
In fact in open source I can imagine a strategy to get an underhanded bug accepted by pairing it with an obvious, trivial bug. The obvious one gets caught and submitter apologizes and resubmits. Then the dev merges the underhanded bug, comforted that a fairly common iteration cycle has been completed with a newbie.
It’s sill wrong to suggest to every contributor to put “the duck” in his contribution, when the fact is then that it’s the people who handle the contribution who are the problem.
I've seen references to something similar in home remodeling - required paint and trim are stored in a closet during the work, but that closet itself never gets repainted. When the homeowner says "hey, you missed this!" they get the option of a discount and doing it themselves or the contractor sending someone to use the materials already on site. People who complete it themselves have a higher feeling of accomplishment because they were involved rather than simply hiring it done.
Haha cool, I've been around a while but I've never heard of this. I wonder if this concept can be combined with confessional debugging, aka rubber-ducking? :)
This is good advice and something I’ve learned over the past years.
One of the interesting points I’ve relfected on over time has been how _my_ issue was solved with the patch, however for others it wasn’t.
When I think about this I of course recognise that Postgres owes me nothing, however we both had similarly aligned objectives (fix bugs, do good), but because of our inability to successfully communicate, we didn’t get to a productive outcome and so a fixable bug sat in the release for some time.
I do wonder how we go about improving that kind of circumstance.
It's hard. I'm pretty sure that in the general case, the problem isn't solvable.
In Morgan LLywelyn's Finn Mac Cool ( https://www.amazon.com/dp/0312877374/ ), Finn is a man with no family from an inferior tribe, and he's a troop leader in a band of scummy soldiers of no social standing whatever. On the other hand, his potential is obvious -- he will eventually work his way up to self-made king.
While stationed in the capital, he falls in love with a respectable, middle-class blacksmith's daughter. She won't give him the time of day because of the difference in social class, but over time, as his respectability steadily climbs, she warms up to him.
They have a failed sexual encounter, and everything falls apart. She feels too awkward to approach him anymore. He believes (incorrectly) that they've become married at a level inappropriate to her class, and eventually comes to her home to propose a much higher grade of marriage. But she doesn't know how to accept without -- in her own eyes -- suffering a loss of dignity. A painful scene follows in which it's obvious that he wants to marry her, she wants to marry him, her mother wants her to marry him, but somehow none of them can see how to actually get to that point.
People really don't go for innovation in social protocols, even when the protocols they know are failing them.
I know that as ‘Take out the duck’.
> you can get around this by giving them something to criticize, with a ready-at-hand response!
So you suggest that the patches should be obviously “worse” in order to have better chance to be accepted?
He is (so much was plain). If those sub-optimal patches are accepted against expectations, a follow up patch with improvements can, in due time, be submitted.
Then please explain how would you create a patch where the only new content is a loop around the existing call to write but “sell” it “with a duck“ easier?
Try with a never-ending loop? That surely gets rejected ;-)
This reply from Tom Lane reminded me of the PC losering issue from the “worse is better” essay .
> 2. What is the rationale for supposing that a retry a nanosecond later
will help? If it will help, why didn't the kernel just do that?
I have found the Postgres community extremely helpful and friendly. Perhaps this part of the interaction explains why they dropped the issue for a while?
Do you have some examples where you disagreed with them or pointed out their mistakes? Obviously everyone is "extremely helpful and friendly" when you agree with them or don't point out their mistakes.
The "asshole reviewer" is definitely a thing.
I don't think I've encountered "asshole reviewer" in postgres community, but maybe my "pain threshold" is just higher than yours. More often than not it's a valid disagreement about technical matters.
Review in the PostgreSQL community does tend to be on the "harsh" side.
I've only submitted one patch (\pset linestyle unicode for psql), and with a few rounds of review and revision it made it in. Overall I found this process nitpicky but productive. However, there is a point at which the harshly critical review can be detrimental, and from the other comments mentioned here, it sounds like this has been the case in the past.
With regard to the semantics of write(2) and fwrite(3) these are clearly documented in SUS and other standards, and the "valid disagreement" in this case may have been very counter-productive if it killed off proper review and acceptance of a genuine bug. There are, of course, problems with fsync(2) which have had widespread discussion for years.
> Review in the PostgreSQL community does tend to be on the "harsh" side.
Can you share an example of a review that you consider harsh? (feel free to share privately, I'm simply interested in what you consider harsh)
I admit some of the reviews may be a bit more direct, particularly between senior hackers who know each other, and from time to time there are arguments. It's not just rainbows and unicorns all the time, but in general I find the discussion extremely civilized (particularly for submissions from new contributors). But maybe that's just survivor bias, and the experience is much worse for others ...
> I've only submitted one patch (\pset linestyle unicode for psql), and with a few rounds of review and revision it made it in. Overall I found this process nitpicky but productive. However, there is a point at which the harshly critical review can be detrimental, and from the other comments mentioned here, it sounds like this has been the case in the past.
Oh, 2009 - good old days ;-) Thanks for the patch, BTW.
There's a fine line between nitpicking and attention to detail. We do want to accept patches, but OTOH we don't want to make the code worse (not just by introducing bugs, even code style matters too). At some point the committer may decide the patch is in "good enough" shape and polish it before the commit, but most of that should happen during the review.
> > Review in the PostgreSQL community does tend to be on the "harsh" side.
> Can you share an example of a review that you consider harsh? (feel free to share privately, I'm simply interested in what you consider harsh)
I wouldn't use the word harsh, but I do think there's plenty reviews where too much bikeshedding happens.
As the other responder noted, "harsh" was probably not the best adjective to use here, so I retract that.
Harsh? All reviews I have got have been nice. The only annoying thing is how some issues can be bikeshedded to death.
I think you patch was too perfect. Always leave some small thing for the gatekeepers to latch on, then be all "oh, yes, totally overlooked this, thank you, I'll fix it ASAP".
I think one of the Dilbert books had this concept. Make one bullet point ridiculous so your boss can have some input and ask you to remove the “Be involved in a land war in Asia” step in your ten point plan.
This is the first I'm hearing of the "take out the duck" concept. I've run software development teams for quite some time now and I have zero problem taking patches that are perfect without nitpicking them.
What the hell is wrong with people?
People have insecurities and egos. Sometimes you come across someone who need to feel useful, or need to feel important, or have a need to assert their power. Or that simply don't trust you, and assumes that if they don't see a problem, something is lurking under the surface.
With a lot of people you will never need this. But especially in larger projects where any number of people might latch on and review submissions, the chances for someone to find something to complain about goes up rapidly.
And when one person has found something to complain about, all kinds of other social dysfunction quickly becomes an issue too.
We’re taking barely-evolved apes whose brains are made for running down antelope and picking bugs out of each other’s hair, and putting them in charge of incredibly complex machines that have only existed for a short time. There are bound to be some problems.
It's something that emerges in big open source projects. Not really related to software development teams that all answer to the same boss.
It's by far not limited to only open source projects. There are many situations where some fault, no matter how small or irrelevant, has to be found in order to sate the ego of the person doing the review or consideration of a problem or piece of work.
I don't think it's limited to open source project either. But open source projects are unusual in that often even when commit rights etc. are controlled, being able to opine on submitted patches is open, and a way of building social status within the project. And while that often works well, it also does attract people who may even mean well, but will be overly critical because it's their way of feeling they're contributing.
In most corporate projects, odds are you're dealing with a much smaller pool of reviewers.
Did you actually read the thread and conclude it to be unreasonable gatekeeping, or are you assuming that it was?
Claiming that there is no problem is in some ways better than gatekeeping and in some ways worse. But you're right, I should not have called this gatekeeping, since the motivations are muddled.
That's an excellent patch. Partial writes (and reads) is a known behavior which is often overlooked. It rarely (if ever) occurs on local file systems, and even the network ones manifest it only from time to time, usually when there is some kind of congestion/segmentation is going on.
It's a pity a lot developers out there do not have an ingrained awareness about this.
Everyone in that thread knows what a partial write is. That's why the patch author saw this error message:
> 2011-07-31 22:13:35 EST postgres postgres [local] LOG: connection authorized: user=postgres database=postgres
> 2011-07-31 22:13:35 EST ERROR: could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes
> 2011-07-31 22:13:35 EST HINT: Check free disk space.
> 2011-07-31 22:13:35 EST CONTEXT: writing block 1 of relation global/2671
> 2011-07-31 22:13:35 EST [unknown] [unknown] LOG: connection received: host=[local]
The proposed patch retries rather than throwing an error.
By "short writes" you mean partial writes?
Yes, thank you.
You can't tell people anything.
In any large organizations, fiefdoms are bound to develop. If you don't have principles for dealing with it, what ends up happening is that new ideas will be crowded out and claimed by one of the leaders of the fiefdom. The Linux model has formalized this process. I think more large open source projects should use this model.
The same pattern emerges in big commercial projects too. Why do you think it took 20 years to fix console resizing in Windows?
Ballmer and his ilk wouldn't allow it.
Yeah, that’s my experience with most groups of people who work with technology. Deny deny deny, default everyone is wrong but me. It’s why I never even tried to be part of some active open source community or god forbid, be the maintainer of something lots of people use.
I don’t understand how anyone intends to learn or grow with that mindset. Some are more polite about it than others but it’s still there.
So many pretend they work in a meritocracy. But when someone from outside their clique tries to follow their rules, well, it’s always something.
this is why we have facebook
Try committing to a Microsoft open source project or working with a lot of ex-Microsoft employees. You’ll want to kill yourself.
It sounds like you have an interesting anecdote to share.
If you can just give us the anecdote rather than making a vague, sweeping critique of MS OSS projects, you might avoid the downvotes.
<off-topic>I committed to a Microsoft open source project and worked with (ex-)Microsoft people. The experience was great :)</off-topic>
I've had this experience when I've tried to contribute to ASP.NET Core and .NET Core - in the case of ASP.NET Core, they weren't in the slightest bit interested in fixing something that's obviously broken. In the case of .NET Core, it was made clear that adding something new to the encryption library was going to take a long time, possibly years, even though there was demand for it.
I've given up on both :/
> In short, PostgreSQL assumes that a successful call to fsync() indicates that all data written since the last successful call made it safely to persistent storage. But that is not what the kernel actually does. When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently, but that behavior usually includes discarding the data in the affected pages and marking them as being clean. So a read of the blocks that were just written will likely return something other than the data that was written.
> Google has its own mechanism for handling I/O errors. The kernel has been instrumented to report I/O errors via a netlink socket; a dedicated process gets those notifications and responds accordingly. This mechanism has never made it upstream, though. Freund indicated that this kind of mechanism would be "perfect" for PostgreSQL, so it may make a public appearance in the near future.
A real-life example can be found at https://stackoverflow.com/questions/42434872/writing-program...
The linked LWN article (from April 2018) is a great summary of the problem and potential solutions and its cause:
Ted Ts'o, instead, explained why the affected pages are marked clean after an I/O error occurs; in short, the most common cause of I/O errors, by far, is a user pulling out a USB drive at the wrong time. If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else. So those pages cannot be kept if the user wants the system to remain usable after such an event.
That justification is bogus however. There's already separate logic for the case the entire underlying device vanishes.
> entire underlying device vanishes
And that would then fail because of the hardware layer's bugs with reporting a device disconnect correctly. I mean, if the user follows the rules and pulls the stick out of a host port or a powered hub, sure, it's likely going to work per spec. But if it's on a daisy-chained 2003-era USB2 hub connected to a cheap USB3 hub? Yeah, good luck.
Is that really justification for incurring unsignalled dataloss? If that's actually common enough, count the number of uncleared errors on per-mount basis, and shut down the filesystem if the memory pressure gets too high while significant portions of memory are taken by dirty buffers that can't be cleaned due to IO errors.
Honestly it would be simpler just to make the "mark clean on write error" behavior a tunable flag rather than try to finesse this. Having the block layer not starve the system on bad hardware as a default behavior seems correct to me.
Also USB bus resets are not unheard of. Or moving devices from one port to another. If the device comes back within a minute or two you probably shouldn't throw out those writes.
If someone quickly pulls an usb drive, plugs it in another system, and then plugs it back in to the original system, then flushing writes could cause massive data corruption if those writes are relative to an outdated idea of what's on the block device. Sounds like a misfeature to me
> If someone quickly pulls an usb drive, plugs it in another system, and then plugs it back in to the original system, then flushing writes could cause massive data corruption
That's user error, though. The kernel should react to removable media being pulled by sending a wall message to the appropriate session/console, stating something similar to "Please IMMEDIATELY place media [USB_LABEL] back into drive!!", with [Retry] and [Cancel] options. That way, the user knows what to expect -- OS's used to do this as a matter of course when removable media was in common use. In fact, you could even generalize this, by asking the user to introduce some specific media (identified by label) when some mountpoint is accessed, even if no operations were actually in progress.
RISC OS works like you propose. If you access the path "ADFS::MyDisk.$.Foo" (that is the Advanced Disk FileSystem, disk called "MyDisk", $ is the root directory and within that the file "Foo"), the user will get a pop-up asking them to insert the disk "MyDisk" into any available disk drive, then press OK to continue the I/O operation successfully. (The user can also click Cancel in which case the I/O operation will return an error.)
You don't ever have had to interact with the "MyDisk" disk before. Simply access it, and (by the time the Disk I/O system call returns) the disk will be there (by virtue of asking the user to insert it.)
AmigaOS worked similarly. I don't remember how it'd behave if you yanked a floppy after having started a write, but certainly on attempting to open a file or change working directory or list directory contents of media that was not currently in a drive would get you a dialog box.
MSDOS used to do this too. And Atari GEMDOS.
I don't think MSDOS did the following aspect of
> In fact, you could even generalize this, by asking the user to introduce some specific media (identified by label) when some mountpoint is accessed, even if no operations were actually in progress.
In MSDOS the paths were like "A:FOO.TXT" as far as I remember. That means you had no facility, as a program, to request the user insert a particular disk identified by a particular label.
For example, on RISC OS (but not on MSDOS) you could implement a program to copy files between two disks by simply reading from the disk with the source label and writing to the disk with the destination label. Even on a machine with a single disk drive. The OS would request the user insert the source and destination disks as appropriate.
The FAT partition header has always had an ID field, the OS could request you to reinsert a removed disk and would know if the wrong one was inserted.
Also, on single-floppy systems there was a virtual B: so you could do disk to disk copies just by saying "copy A:foo B:" etc.
In FreeBSD there’s an (optional) mechanism for just that: gmountver.
Disagree, it is a kernel error if it cannot gracefully deal with removable media being, you know, removed.
The drive was in a corrupt state the first time it got unplugged. And it's nothing to shrug off, it might have been in the middle of rewriting a directory and lose all the contents.
So what are the odds that A) you get it back into a non-corrupt state B) the sectors affected by finishing the write will re-corrupt it C) you do this in one minute?
> Also USB bus resets are not unheard of.
They're initiated by the host, not by a USB device.
> Or moving devices from one port to another. If the device comes back within a minute or two you probably shouldn't throw out those writes.
This would be a nice feature. Although these writes would need to be buffered. Probably also throttled. There'd also be some risk with devices that have identical serial numbers. Some manufacturers give all of their USB disks / memory sticks same serial number...
> They're initiated by the host, not by a USB device.
Or by a power flicker. Which can be caused by plugging in other devices too.
> Although these writes would need to be buffered. Probably also throttled.
You don't necessarily have to allow new writes, the more important part is preserving writes the application thinks already happened. But that could be useful too.
> Some manufacturers give all of their USB disks / memory sticks same serial number...
You have the partition serial number too, usually.
> Or by a power flicker. Which can be caused by plugging in other devices too.
That sounds pretty unlikely. Any references about this? The pulse needs to be pretty particular and the USB device needs to be powered.
Personal experience, so not really.
And it doesn't have to be very particular. Bad grounding can cause lots of ports to reset.
Sounds intentional, probably something sent by the USB host controller driver. I guess some hub chips might send it independently as well in some scenarios.
USB bus resets (both D+ and D- down for 10ms) are a signal for the USB device software (well, firmware) to initialize device configuration. Basically to set configuration, data toggles and stalls to their defaults.
(I've written USB device firmware.)
Well yes it's an intentional recovery from an error state, but the point is this can happen unexpectedly.
> There'd also be some risk with devices that have identical serial numbers. Some manufacturers give all of their USB disks / memory sticks same serial number...
On most OSes the HW serial number of the disk is now usually supplemented in the disk management logic with the GPT “Disk GUID”, if available. Most modern disks (including removable ones like USB sticks) are GPT-formatted, since they rely on filesystems like ExFAT that assume GPT formatting. And those that aren’t are effectively already on a “legacy mode” code-path (because they’re using file systems like FAT, which also doesn’t support xattrs, or many types of filenames, or disk labels containing lower-case letters...) so users already expect an incomplete feature-set from them.
Plus: SD cards, the main MBR devices still in existence, don’t even get write-buffered by any sensible OS to begin with, precisely because you’re likely to unplug them at random. So, in practice, everything that needs write-buffering (and will ever be plugged into a computer running a modern OS) does indeed have a unique disk label at some level.
Does it always work? I had issues with encrypted filesystems that would stay mounted after the device itself disconnected and required a forced unmount before I could use them again.
Did I just get downvoted for giving an example where the OS doesn't seem to handle the disconnect of the physical device at all?
Write error is EIO, device disappearing is (or should be) ENXIO. Kernel should be able to tell the difference instead of just ignoring write failures.
Bigtable, and presumably Spanner, goes beyond that. Data chunks have checksums and they're immediately re-read and verified after compactions, because it turns out that errors do happen and corruption also happens, even when you use ECC — roughly once every 5PB compacted, per Jeff Dean figures.
Sounds like Google is continuing to act poorly when it comes to upstreaming their code.
I like to beat on them just like everyone else but what is the expectation? Everything they write if it runs at an OS level?
Google should be working with the kernel team to get their code mainlined, rather than keep their patches out of tree for years on end.
Perhaps they tried. Do you have evidence otherwise?
Doesn't this affect all databases? Or is it a different issue?
SQLite does a "flush" or "fsync" operation at key points. SQLite assumes that the flush or fsync will not return until all pending write operations for the file that is being flushed have completed. We are told that the flush and fsync primitives are broken on some versions of Windows and Linux. This is unfortunate. It opens SQLite up to the possibility of database corruption following a power loss in the middle of a commit. However, there is nothing that SQLite can do to test for or remedy the situation. SQLite assumes that the operating system that it is running on works as advertised. If that is not quite the case, well then hopefully you will not lose power too often.
Also this seems related:
That results in consistent behavior and guarantees that our operation actually modifies the file after it's completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.
The linked OSDI '14 paper looks good:
We find that applications use complex update protocols to persist state, and that the correctness of these protocols is highly dependent on subtle behaviors of the underlying file system, which we term persistence properties. We develop a tool named BOB that empirically tests persistence properties, and use it to demonstrate that these properties vary widely among six popular Linux file systems.
Yeah, IMO it's a Linux fsync bug. fsync() should not succeed if writes failed. fsync() should not clean dirty pages if writes failed. Both of these behaviors directly contravene the goals of user applications invoking fsync() as well as any reasonable API contract for safely persisting data.
Arguably, POSIX needs a more explicit fsync interface. I.e., sync these ranges; all dirty data, or just dirty data since last checkpoint; how should write errors be handled; etc. That doesn't excuse that Linux's fsync is totally broken and designed to eat data in the face of hardware errors.
That Dan Luu blog post you linked is fantastic and one I really enjoyed.
I worked on PostgreSQL's investigation and response to this stuff and I agree with you FWIW. But apparently only FreeBSD (any filesystem) and ZFS (any OS) agree with us. Other systems I looked at throw away buffers and/or mark them clean, and this goes all the way back to ancient AT&T code. Though no one has commented on any closed kernel's behaviour.
I doubt anyone worried much when disks just died completely in the good old days. This topic is suddenly more interesting now as virtualisation and network storage create more kinds of transient failures, I guess.
I'm a FreeBSD dev, that probably colors my opinion :-).
Yet another reason to use ZFS I guess.
There have been bugs involved on both sides.
On the kernel side the error reporting was not working reliably in some cases (see  for details), so the application using fsync may not actually get the error at all. Hard to handle an error correctly when you don't even get notified about it.
On the PostgreSQL side, it was the incorrect assumption that fsync retries past writes. It's an understandable mistake, because without the retry it's damn difficult to write an application using fsync correctly. And of course we've found a bunch of other bugs in the ancient error-handling code (which just confirms the common wisdom that error-handling is the least tested part of any code base).
Right; I've been following the headlines on this saga for years :-).
> On the PostgreSQL side, it was the incorrect assumption that fsync retries past writes.
That's the part I'm claiming is a Linux bug. Marking failed dirty writes as clean is self-induced data loss.
> It's an understandable mistake, because without the retry it's damn difficult to write an application using fsync correctly.
This is part of why it's a bug. Making it even more difficult for user applications to correctly reason about data integrity is not a great design choice (IMO).
> And of course we've found a bunch of other bugs in the ancient error-handling code (which just confirms the common wisdom that error-handling is the least tested part of any code base).
I'm not sure.
It's easy to blame other layers for not behaving the way you want/expect, but I admit there are valid reasons why it behaves the way it does and not the way you imagine. The PostgreSQL fsync thread started with exactly "kernel is broken" but that opinion changed over time, I think. I still think it's damn difficult (border-line impossible) to use fsync correctly in anything but trivial applications, but well ...
Amusingly, Linux has sync_file_range() which supposedly does one of the things you describe (syncing file range), but if you look at the man page it says "Warning: ... explanation why it's unsafe in many cases ...".
> It's easy to blame other layers for not behaving the way you want/expect,
Sure; to be clear, I work on both sides of the this layer boundary (kernel side as well as on userspace applications trying to ensure data is persisted) on a FreeBSD-based appliance at $DAYJOB, but mostly on the kernel side. I'm saying — from the perspective of someone who works on kernel and filesystem internals and page/buffer cache — cleaning dirty data without successful write to media is intentional data loss. Not propagating IO errors to userspace makes it more difficult for userspace to reason about data loss.
> but I admit there are valid reasons why it behaves the way it does and not the way you imagine.
How do you imagine I imagine the Linux kernel's behavior here?
> The PostgreSQL fsync thread started with exactly "kernel is broken" but that opinion changed over time, I think. I still think it's damn difficult (border-line impossible) to use fsync correctly in anything but trivial applications, but well ...
The second sentence is a good argument for the Linux kernel's behavior being broken.
I think the only real defense of the linux et al behaviour here is that:
a) there's no standardized way to recover from write errors when not removing dirty data. I personally find that more than an acceptable tradeoff, but obviously not everybody considers it that way.
b) If IO errors, especially when triggered by writeback, cause kernel resources to be retained (both memory for dirty data, but even just a per-file data to stash a 'failed' bit), it's possible to get the kernel stuck in a way it needs memory without a good way to recover from that. I personally think that can be handled in smarter ways by escalating per-file flags to per-fs flags if there's too many failures, and remounting ro, but not everybody sees it that way...
That sounds reasonable to me.
> b) If IO errors, especially when triggered by writeback, cause kernel resources to be retained (both memory for dirty data, but even just a per-file data to stash a 'failed' bit), it's possible to get the kernel stuck in a way it needs memory without a good way to recover from that.
To put it more succinctly: userspace is not allowed to leak kernel resources. (The page cache, clean and dirty, is only allowed to persist outside the lifetime of user programs because the kernel is free to reclaim it at will — writing out dirty pages to clean them.)
> I personally think that can be handled in smarter ways by escalating per-file flags to per-fs flags if there's too many failures, and remounting ro, but not everybody sees it that way...
Yeah, I think that's the only real sane option. If writes start erroring and continue to error on a rewrite attempt, the disk is failing or gone. The kernel has different error numbers for these (EIO or ENXIO). If failing, the filesystem must either re-layout the file and mark the block as bad (if it supports such a concept), or fail the whole device by with a per-fs flag. A failed filesystem should probably be RO and fail any write or fsync operation with EROFS or EIO.
If the device has been failed, it's ok to clean the lost write and discard it, releasing kernel resources, because we know all future writes and fsyncs to that file / block will also report failure.
This model isn't super difficult to understand or implement and it's easier for userspace applications to understand. The only "con" argument might be that users are prevented from writing to drives that have "only a few" bad blocks (if the FS isn't bad block aware). I don't find that argument persuasive — once a drive develops some bad sectors, they tend to develop more rapidly. And allowing further buffered writes can really exacerbate the amount of data lost, for example, when an SSD reaches its lifetime write capacity.
Isn't one of the difficulties the inability to decide when the I/O error is transient (e.g. running out of space with thin provisioning) vs. permanent (e.g. drive eaten by a velociraptor)?
Also, isn't it possible to use multipath to queue writes in case of error? I wonder if that's safe, though, because it will keep it in memory only and make it look OK to the caller.
> Isn't one of the difficulties the inability to decide when the I/O error is transient (e.g. running out of space with thin provisioning) vs. permanent (e.g. drive eaten by a velociraptor)?
I don't really think so. The thin provisioning thing really is just a series of bugs in the kernel, because the thin provisioning code just didn't (and maybe doesn't), bubble up errors with enough information.
Even if errors are transient, at some point the decision to shut down the FS on a device with enough transient errors is a much better fix than throwing away data.
Up until very recently (2-3 years ago) there was very little discussion of how to do consistent disk I/O properly, so it doesn't surprise me at all that many applications don't actually work.
Combine this with I/O error handling often being broken (in kernels, file systems, applications) and applications that are supposed to implement transaction-semantics can easily turn into "he's dead jim" at the first issue.
Tomas Vondra goes over this in the talk. There's a rationale for that behavior: pulling out a USB stick out of the USB socket may have been what's triggered the fsync() failure. In that case, there's no way the kernel will be able to retry reliably.
Silent error ignoring is never a great API. Especially for a data integrity operation.
fsync in case of the USB disappearing should simply return an error and drop the dirty pages.
> Doesn't this affect all databases? Or is it a different issue?
Possibly, https://wiki.postgresql.org/wiki/Fsync_Errors notes both MySQL and MongoDB had to be changed.
Note that the issue here different from either of the bits you quote. The problem here is that if you fsync(2) it fails and you fsync(2) again, on many systems the second call will always succeed because the first one has invalidated/cleared all extant buffers, and thus there's nothing for the second one to sync. Which is a success.
AKA because of systems' shortcuts an fsync success effectively means "all writes since the last fsync have succeeded", not "all writes since the last fsync success have succeeded". Writes between a success and a failure may be irrecoverably lost
> Doesn't this affect all databases?
Yes, most. And several did similar changes (crash-restart -> recovery) to handle it too. It's possible to avoid the issue by using direct IO too, but often that's not the default mode of $database.
No, this is an artifact of storage engine design. Direct I/O is the norm for high-performance storage engines -- they don't use kernel cache or buffered I/O at all -- and many will bypass the file system given the opportunity (i.e. operate on raw block devices directly) which eliminates other classes of undesirable kernel behavior. Ironically, working with raw block devices requires much less code.
Fewer layers of abstraction between your database and the storage hardware make it easier to ensure correct behavior and high performance. Most open source storage engines leave those layers in because it reduces the level of expertise and sophistication required of the code designers -- it allows a broader pool of people to contribute -- though as this case shows it doesn't not necessarily make it any easier to ensure correctness.
> Most open source storage engines leave those layers in because it reduces the level of expertise and sophistication required of the code designers -- it allows a broader pool of people to contribute -- though as this case shows it doesn't not necessarily make it any easier to ensure correctness.
Another reason is that it's easier to deploy — you can just use some variable space on your filesystem rather than shaving off a new partition.
In practice, implementations can be deployed either way. The storage hardware is always dedicated to the database for large-scale or performance-sensitive databases anyway, making it less of an inconvenience. For some common environments, raw block devices substantially increase throughput versus the filesystem, so there are real advantages to doing so when it makes sense.
Yea, but the largest portion of databases these days is not deployed on dedicated hardware that's been carefully capacity planned, and has a dedicated DBA.
It's not like it shouldn't be possible to make a durable application with buffered IO though. The OS designers just haven't given it much focus.
Notably, Go had failed to use F_FULLFSYNC on MacOS until I reported it. Fix landing in 1.12, but won't be back-ported.
I think it is a different matter.
MacOS documented its abnormal fsync behavior, Golang just didn't follow what was clearly described in those documents. Linux didn't document what happens on fsync failure, there is really nothing for applications to follow.
Also note that, the strange MacOS fsync() behavior is obvious, its fsync latency on the most recent mbp by default is close to the one observed on Intel Optane when we all know that mbp comes with much cheaper/slower SSD compared to Intel Optane. The same can't be said for the Linux fsync issue here.
According to https://wiki.postgresql.org/wiki/Fsync_Errors MacOS also has the Linux issue anyway.
Didn't affect LMDB. If an fsync fails the entire transaction is aborted/discarded. Retrying was always inherently OS-dependent and unreliable, better to just toss it all and start over. Any dev who actually RTFM'd and followed POSIX specs would have been fine.
LMDB's crash reliability is flawless.
> Any dev who actually RTFM'd and followed POSIX specs would have been fine.
So I'm trying to do that  and it seems to me the documentation directly implies that a second successful call to fsync() necessarily implies that all data was transferred, even if the previous call to fsync() had failed.
I say this because the sentence says "all data for the open file descriptor" is to be transferred, not merely "all data written since the previous fsync to this file descriptor". It follows that any data not transferred in the previous call due to an error ("outstanding I/O operations are not guaranteed to have been completed") must now be transferred if this call is successful.
What am I missing?
Technically, there is nothing besides "all data written since the previous fsync" since the previous fsync already wrote all the outstanding data at that point in time. (Even if it actually failed.) I.e., there is never any to-be-written data remaining after fsync returns. Everything that was queued to be flushed was flushed and dequeued. Whether any particular write failed or not doesn't change that fact.
> the previous fsync already wrote all the outstanding data at that point in time. (Even if it actually failed.)
I'm sorry, but this... just doesn't make sense. An event can't both happen and also fail to happen.
Also, the queue business is an implementation detail. Our debate is over what the spec mandates, not how a particular implementation behaves.
> I'm sorry, but this... just doesn't make sense. An event can't both happen and also fail to happen.
Consider a multithreaded application where a and b point to the same file. Note this isn't exactly what Postgres does:
No? It says "all data for the open file descriptor", not "all data for all open file descriptors".
The file descriptors point to the same file and can even be the same descriptor.
Maybe this is clearer/simpler, writing x and y to the same offset:
I'm confused, where is the reading happening that is causing T2 to be surprised? But in any case, I'm saying that if a and b are the same descriptor, then fsync(b)'s success should imply both a and b are written to disk, by the specification of fsync. I don't see where you're seeing a contradiction.
I understand what you're saying. If a and b are the same descriptor, then fsync(b)'s success DOES NOT mean that it was written to because the error was already reported by fsync(a)'s failure report. I think you're missing that there are two fsync() calls, one that fails and the other that doesn't.
And FYI: Even if they're not the same descriptor, you still have risk.
You're telling me what's actually happening, but that's not where we're disagreeing. I agree that's what happens. What I'm saying is that it's wrong. It goes against the spec of fsync. It all data is to be written, not merely all data written since the last call to fsync. Your example didn't show any contradiction in this reasoning.
> It [says] all data is to be written, not merely all data written since the last call to fsync.
No it doesn't.
The spec† says: If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed. That clearly permits the Linux behaviour; There is no "data" that hasn't been sent to the "transferred to the storage device" at this point. You can see it from the perspective of the Linux kernel:
write() = success
fsync() = error: outstanding I/O not guaranteed††.
fsync() = success: no outstanding I/O
††: It would've been a different story if the spec said, If "the fsync() function fails, all future fsync() to the same file descriptor must fail." but it doesn't say this.
No, the spec is longer than that one sentence. That sentence of the spec is merely saying that if fsync fails then the I/O operations haven't necessarily been completed. That's completely true and obvious and yes, it's also consistent with Linux. That's not the problem. The problem is the other sentence I already quoted which is that "all data for the open file descriptor is to be transferred to the storage device". This includes any data that hasn't been written, which includes data that has failed to write on a previous call to fsync. Like I already said once, you can't claim data that was sent to the device not failed to write to persistent storage is considered "transferred" since by that logic fsync would never guarantee any data was actually written to persistent storage (read the entire spec). I already explained all this but I'm repeating myself since it seems like every time you respond you're only choosing to read half of the spec and half of what I'm writing, but I'm tired of continuing.
But the "spec" uses that line, so you can't very well ignore it either. Most unixes also have this behaviour, so this isn't some nonconformant outlier, this is the standard behaviour. It might not be ideal, but most of the UNIX IO model and heritage isn't.
Apologies for assuming you didn't understand the behaviour when you said it "doesn't make sense": I'd assumed you meant that you were having trouble understanding the behaviour, rather than "all the application developers and unix implementors" that expect this behaviour, not understanding "the spec".
The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to
the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.
The fsync() function shall not return until the system has completed that action or until an error is detected.
There's no mention of the OS retrying, or leaving the system in a state that a subsequent fsync can retry from where it left off. So you can't assume anything along those lines.
So you're suggesting "storage device" here isn't the same thing as "storage medium"? i.e. that merely transferring the data to a volatile buffer on the device is also considered being transferred to the "storage device", rather than to the (nonvolatile) storage medium?
By that logic even after a successful call you can't rely on the data being persisted, which goes squarely against the entire point of the function and its documentation: "The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk."
So this interpretation is wrong...
Nonsense. If the device returns "success" then everything was persisted. If it returns an error, then some/all of it was not persisted. There is no way for you to determine whether it is some or all or which. The only safe action for a user is to assume all of it failed.
And by the way, there are devices out there that lie, and claim the data is successfully persisted even though it only made it into a volatile cache.
Everything you said in this comment is consistent with what I've been saying. Nowhere did I suggest the user can assume data was written if an fsync call fails. I'm saying if an fsync call fails, the documentation of fsync (which I quoted multiple times) implies the next call will attempt to write data that wasn't successful it written on the last call.
Yeah, and we still disagree there. Nothing in the doc implies that just calling fsync() again will do anything useful, there is no implication of retryability.
> Nothing in the doc implies that just calling fsync() again will do anything useful
"Nothing" in the doc? More like everything in the doc? It literally says "all data for the open file descriptor is to be transferred" and "If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed." Together these are literally saying that all data that wasn't transferred due to a failure in one call will be transferred in the next successful call.
I could maybe buy it if the doc described a notion of "dequeueing" that was separate from writing, but it doesn't. It just talks about queued operations completing. So either they complete successfully (and are subsequently dequeued because that is common sense) or they don't.
Like if your boss had assigned you tasks A, B, and then C, and then he ordered you to finish all your tasks, and you failed to do B, and then he made the same request again, you wouldn't then suddenly turn to him and say "I have nothing to do because I already popped B and C off my to-do list". You'd get fired for that (at least if you persisted) because it literally wouldn't make sense.
You're going in circles.
You are equating "transferred" with "completed" but they are clearly not the same.
All the data was transferred. After transfer, the data on the device may have been written or may have been lost. The OS doesn't remember what has been transferred after the transfer - there is no language about this anywhere in this text.
But the spec doesn't imply in any way that you'll need to write() again if fsync() fails. And dropping dirty flags seems way out of spec because now you can read data from cache that is not on disk and never will be even if future fsync() succeeds.
So I don't buy the spec argument.
You have me curious now... I don't know anything about LMDB but I wonder if its msync()-based design really is immune... Could there be a way for an IO error to leave you with a page bogusly marked clean, which differs from the data on disk?
The spec leaves the system condition undefined after an fsync failure. The safe thing to do is assume everything failed and nothing was written. That's what LMDB does. Expecting anything else would be relying on implementation-specific knowledge, which is always a bad idea.
> I don't know anything about LMDB but I wonder if its msync()-based design really is immune.
By default, LMDB doesn't use msync, so this isn't really a realistic scenario.
If there is an I/O error that the OS does not report, then sure, it's possible for LMDB to have a corrupted view of the world. But that would be an OS bug in the first place.
Since we're talking about buffered writes - clearly it's possible for a write() to return success before its data actually gets flushed to disk. And it's possible for a background flush to occur independently of the app calling fsync(). The docs specifically state, if an I/O error occurs during writeback, it will be reported on all file descriptors open on that file. So assuming the OS doesn't have a critical bug here, no, there's no way for an I/O error to leave LMDB with an invalid view of the world.
> The spec leaves the system condition undefined [...]. The safe thing to do is assume everything failed
This is key.
Often programmers do 'assumption based programming'.
"Surely the function will do X, it's the only reasonable thing to do, right?". As much it is human, this is bad practice and leads to unreliable systems.
If the spec doesn't say it, don't assume anything about it, and keep asking. To show that this approach is feasible for anyone, here is an example:
Recently I needed to write an fsync-safe application. The question of whether close()+re-open()+fsync() is safe came up. I found it had been asked on StackOverflow (https://stackoverflow.com/questions/37288453/calling-fsync2-...) but received no answers for a year. I put a 100-reputation bounty on it and quickly got a competent reply quoting the spec and summarising:
> It is not 100% clear whether the 'all currently queued I/O operations associated with the file indicated by [the] file descriptor' applies across processes. Conceptually, I think it should, but the wording isn't there in black and white.
With the spec being unclear, I took the question to the kernel developers (https://marc.info/?m=152535098505449), and was immediately pointed at the Postgres fsyncgate results.
So by spending a few hours on not believing what I wished was true, I managed to avoid writing an unsafe system.
Always ask those in the know (specs and people). Never assume.
Yeesh. The POSIX manual on fsync is intentionally vague to hide cross-platform differences. There are basically no guarantees once an error happens. I guess that's one interpretation of RTFMing, but... clearly it doesn't match user expectations.
Mr. Chu, I hope you never lose your tenacity with respect to writing blurbs on LMDB's performance and reliability. I have enjoyed the articles comparing LMDB with other databases' performance and hope you continue to point the spotlight on the superior design decisions of LMDB.
Thanks, glad you're getting something out of those writeups. Hopefully it helps other developers learn a better path.
What if another process calls fsync? It will get the error. Then when LMDB calls fsync no error will be reported. And thus the transaction will not be retried.
Is this scenario dealt with?
Newer versions of linux (but not plenty of other OSs) guarantee that each write failure will be signalled once to each file descriptor open before the error occurred. So that ought to be handled, unless the FD is closed inbetween.
LMDB is single-writer. Only one process can acquire the writelock at a time, and only the writer can call fsync() so this scenario doesn't arise.
If you open the datafile with write access from another process without using the LMDB API, you're intentionally shooting yourself in the foot and all bets are off.
On some systems, fsync is system-wide. Another process fsyncing an unrelated file descriptor can still consume the error meant for the LMDB file descriptor. Same thing goes for a user running the sync command from the terminal. A write lock won't protect you from this, unless you can prevent all other processes from calling fsync. It's got nothing to do with opening the LMDB datafile concurrently. If you share a physical disk device with any other process, you're at risk.
fsync() is not documented to be system-wide (while sync() is). That behavior would also be an OS bug. The question this person asked was specifically about fsync().
Sorry, that was confusing, I meant "system-wide" as in file system, not OS, i.e. "if you share a physical disk device with any other process".
When you flush a particular FD, some file systems just flush every dirty buffer for the entire disk. I wouldn't actually be surprised though if there are some kernels that flush all disks either, regardless of whether it's considered a bug or not.
"The question this person asked was specifically about fsync()."
Sure, but as you acknowledged elsewhere, if sync is called, it flushes everything, and this impacts upon the person who "asked specifically about fsync" since there's a chance on some kernels that sync will eat the error that fsync was expecting.
I'm pretty sure it does affect any database/application relying on buffered I/O. Even if you use the fsync() interface correctly, you're still affected by the bugs in error reporting.
Technically there is no bug in error reporting. fsync() reported an error as it should. The application continued processing, instead of stopping. fsync() didn't report the same error a second time, which leads to the app having problems.
The application should have stopped the first time fsync() reported an error. LMDB does this, instead of futile retry attempts that Postgres et al do. Fundamentally, a library should not automatically retry anything - it should surface all errors back to the caller and let the caller decide what to do next. That's just good design.
No. Kernels before 4.13 may or may not report the fsync error correctly, depending on various conditions.
There are more details in the talk  I posted earlier, and in the LWN articles related to this issue.
Ah I see. https://lwn.net/Articles/718734/
Someone issuing a sync() could cause an error to be cleared before the app's fsync() happens. That's a drag.
in the reported case, fsync does not persist all data to disk, but it reports success. How does LMDB deal with that situation?
In the reported case, fsync reported an error, then (more data may or may not have been written), then fsync is tried again and reports a success, which masks the fact that data from the previous fsync didn't get fully written.
As I already wrote - in LMDB, after fsync fails, the transaction is aborted, so none of the partially written data matters.
Hold on, they are saying that if sync fails (for example if someone types "sync" at the console), then the database calls fsync() it will not fail even though the data is gone. I don't see how any database the uses the buffer cache could guard against this case.
The kernel should never do this. If sync fails, all future syncs should also fail. This could be relaxed in various ways: sync fails, but we record which files have missing data, so any fsync for just those files also fails.
(Otherwise I agree with LMDB- there should be no retry business on fsync fails).
You're right, that was another error case in the article that I missed the first time.
In LMDB's case you'd need to be pretty unlucky for that bug to bite; all the writes are done at once during txn_commit so you have to issue sync() during the couple milliseconds that might take, before the fsync() at the end of the writes. Also, it has to be a single-sector failure, otherwise a write after the sync will still fail and be caught.
If only a single page is bad, and the majority of the writes are ok, then you still have to be unlucky enough for the sync to run after the bad write; if it runs before the bad write then fsync will still see the error.
We've been using LMDB at Cloudflare to store small-ish configuration data, it has been rock solid.
Thank you and the rest of the contributors for such a great library.
LMDB claimed speed and reliability seems remarkable (from a quick glance). I would guess is easier to achieve such, for a KV store, than for much more complex Relational Database. Got me thinking though.
Mayby Postgres could take advantage of LMDB?
Mayby by using LMDB as it’s cache? instead of using OS page cache, maybe writing the WAL to LMDB?
LMDB itself only uses the OS page cache. The way for LMDB to improve an RDBMS is for it to replace the existing row and index store, and eliminate any WAL. This is what SQLightning does with SQLite.
Have looked at replacing InnoDB in MySQL, but that code is a lot harder to read, so it's been slow going. Postgres doesn't have a modular storage interface, so it would be even uglier to overhaul.
Thanks, make sense, I think Postgres are planning go have pluggable storage interface in nest version 12, would that help?
Also nobody has mention data checksum added v9.3, do you know if this helps avoid this kind of fsync related corruption?
> Also nobody has mention data checksum added v9.3, do you know if this helps avoid this kind of fsync related corruption?
Not really, I think. Page-level checksums don't protect against entire writes going missing, unfortunately.
LMDB is not a great fit for something like the WAL, where new data is written at the end, and old data discarded at the start. It leads to fragmentation (especially if WAL entries are larger than a single page).
Maybe with https://github.com/kellabyte/rewind which should support WAL for lmdb
"have an fsync that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk"
I remember highlighting this problem to the Firebird db developers maybe 13 years ago. AFAIR they were open to the problem I'd pointed out and went about fixing it on other platforms so that the db behaved everywhere as it did on OS X. I'm probably in the bottom 5% of IT professionals on this site. I'm rather amazed to find out that so late in the day Postgres has come round to fixing this.
I haven't used Firebird in years and can't find a link to the discussion (could have been via email).
Postgres has offered F_FULLFSYNC on OSX since 2005?
> Both Chinner and Ts'o, along with others, said that the proper solution is for PostgreSQL to move to direct I/O (DIO) instead.
Wait, is "direct I/O" the same as O_DIRECT?
The same O_DIRECT that Linus skewered in 2007?
> There really is no valid reason for EVER using O_DIRECT. You need a buffer whatever IO you do, and it might as well be the page cache. There are better ways to control the page cache than play games and think that a page cache isn't necessary.
> Side note: the only reason O_DIRECT exists is because database people are too used to it, because other OS's haven't had enough taste to tell them to do it right, so they've historically hacked their OS to get out of the way.
More background from 2002-2007: https://yarchive.net/comp/linux/o_direct.html
Turns out sometimes people other than Linus have more experience with IO than Linus.
I think there's pretty good reasons to go for DIO for a database. But only when there's a good sysadmin/DBA and when the system is dedicated to the database. There's considerable performance gains in going for DIO (at the cost of significant software complexity), but it's much more sensitive to bad tuning and isn't at all adaptive to overall system demands.
Yes, more than anything I'm amused by the fact that when you do the Linus-approved thing in 2007 it leaves you in this terrible situation in 2019, and when the other kernel experts rub their heads together their solution is to abandon the gentle advice from 10 years earlier.
Yeah, which is one of the reasons PostgreSQL went with buffered I/O, not to have to deal with this complexity. And it served us pretty well over time, I think.
I don't think that's really true. It worked well enough, true, but I think it allowed us to not fix deficiencies in a number of areas that we should just have fixed. IOW, I think we survived despite not offering DIO (because other things are good), rather than because of it.
I don't think we disagree, actually.
Yes - from a purely technical point of view, DIO is superior in various ways. It allows tuning to specific I/O patterns, etc.
But it's also quite laborious to get right - not only does it require a fair amount of new code, but AFAIK there is significant variability between platforms and storage devices. I'm not sure the project had enough developer bandwidth back then, or differently - it was more efficient to spend the developer time on other stuff, with better cost/benefit ratio.
I have to say he’s wrong when he says there’s no reason for EVER using O_DIRECT.
One HUGE reason is performance for large sequential writes when using fast media and slow processors. Specifically, when the write speed of the media is in the same ballpark as the effective memcpy() speed of the processor itself, which, believe it or not, is very possible today (but was probably more unlikely in 2007) when working with embedded systems and modern NVMe media.
Consider a system with an SoC like the NXP T2080 that has a high speed internal memory bus, DDR3, and PCIe 3.0 support. The processor cores themselves are relatively slow PowerPC cores, but the chip has a ton of memory bandwidth.
Now assume you had a contiguous 512 MiB chunk of data to write to an NVMe drive capable of a sustained write rate of 2000 MB/s.
The processor core itself can barely move 2000 MB/s of data, so it’s clear why direct IO would perform better since you’re telling the drive to pull the data directly from the buffer instead of memcpy-ing into into an intermediate kernel buffer first. With direct IO, you can perform zero-copy writes to the media.
This is why I’m able to achieve higher sequential write performance on some modern MVMe drives than most benchmark sites report, all while using a 1.2 GHz PowerPC system.
> Wait, is "direct I/O" the same as O_DIRECT?
No, there are other ways of doing DIO besides that particular interface.
Care to elaborate? And won't use those different interfaces the same direct i/o implementation, much like e.g. raw devices? I take above question as rhetorical and would think the discussion above as well as Linus' rant apply to those too.
Historic note (like from the 80's) - any time a machine was rebooted we'd type sync; sync; sync; reboot - the explanation was that the only guarantee was that the second sync wouldn't start until the first sync successfully completed, plus one for good luck...
people were told 'do several sync commands, typing each by hand'
But this mutated to just 'sync three times', so of course people started writing 'sync; sync; sync'
I still type sync (just once) before I reboot. Just a habit.
The last thing I do before I leave my work desktop for the day is
./eod.sh && sync && sync
The only machine I ever developed on that needed a sync before reboot was an ancient SPARC workstation.
I forgot to sync almost every time, but it would always boot after a fsck.
I always get some kind of error message from dmraid about having been unable to stop the RAID array on shutdown/reboot. I thus manually do a sync(1) in hopes that the data survives. Hasn't failed me thus far, at least.
I have a flash drive that I sometimes put a video on to watch it on a small TV in the basement and I've noticed that Linux doesn't copy the file right away. The 'cp' does finish quickly but the data is not on the flash drive yet. You either have to eject and wait or sync and wait for it to actually transfer.
Needless to say, this tripped me up few times and videos weren't fully transferred.
That behaviour is defined by a mount option.
What's the proper way to mount flash devices?
Try the sync option. From the man page:
> All I/O to the filesystem should be done synchronously. In the case of media with a limited number of write cycles (e.g. some flash drives), sync may cause life-cycle shortening.
Thanks so much!
I also noticed that on Gnome. I never investigated the implementation details of that progress bar but my gut feelings is that the file is read from (or written to) the buffer cache quickly and the progress bar goes near to 100%, then stays there until the last writes succeed and actually write to the USB stick. Then the eject button sometimes need extra time to finish the sync and tells me to wait a little. I always remove the stick when it tells me it's safe to do it.
Your intuition is correct. The writes are quickly buffered to RAM and then fsync or close writes them out to the slo media. The (naive) progress bar probably only tracks progress buffering the writes — it's the simplest way to track progress, if inaccurate.
Yes. You can also run "watch cat /proc/meminfo" in console and watch "Dirty" and "Writeback" fields to see dynamics (and how long you have to wait).
"sync;sync;halt" - followed by turning off the power so we could open up the cabinet to do maintenance...
I think it's worthwhile to note that this, even before both kernel and postgres changes, really only is an issue if there are very serious storage issues, commonly causing more corruption than just forgetting the data due to be fsynced. The kernel has its own timeout / retry logic and if those retries succeed, there's no issue. Additionally, most filesystems remount read-only if there's any accompanying journaling errors, and in a lot of cases PG IO will also have a metadata effect.
i/o errors are an ongoing issue for AWS M5 and C5 instance types with their cloudy nvme devices, they have a tendency to randomly disappear or have extended timeouts
unfortunately the upstream patches being backported don't provide any real write guarantee
I don't really understand why people went with this headline and not "PostgreSQL developers discover most programs have used fsync incorrectly for decades, including PostgreSQL".
Because most normal programs do open-write-sync-close, and that mostly works as expected.
Postgres does open-write-close and then later, in another unrelated process, open-fsync-close. They discovered this doesn’t always work, because if somebody somewhere does a fsync on that file for any reason, their process might miss the error as it doesn’t stick.
I agree, the initial report on pgsql-hackers had a subject that blamed PostgreSQL too strongly and I think that carried through to articles and blogs and Twitter etc.
You are right, this affects other things too. And indeed several projects took action to remove any notion of fsync retry, referencing the LWN article etc.
POSIX is probably underspecified. AN Austin Group defect report would be interesting...
It's too long by 26 characters?
Am I misunderstanding this or does this mean Linux literally does not provide any way to ensure writes are flushed to a file in the presence of transient errors?
Does anyone know if Windows's FlushFileBuffers is susceptible to this as well? (P.S., interesting bit about FILE_FLAG_WRITE_THROUGH not working as you might expect: https://blogs.msdn.microsoft.com/oldnewthing/20170510-00/?p=...)
For the record and not mentioned in Tomas's talk: the PostgreSQL release due out next week will add a PANIC after any fsync failure (in other words, no longer retry). The same thing has been done by MySQLand MongoDB and probably everyone else doing (optional) buffered IO for database-ish stuff who followed this stuff on LWN etc.
Only FreeBSD & Illumos do the sane thing.
The wiki uses the word "presumably". I take that as "the devs say it works but nobody tested it".
OpenBSD has brought forth a patch earlier this month to try and make fsync(2) less of a mess to use on OpenBSD, though it hasn't been committed yet.
Note that that patch doesn't really fix the issue. You can write();close();open();fsync(); and you'll miss the issue if the OS failed during writeback before the open(). That's worse than on new-ish linux, where at least it'll only forget the error if there's a lot of memory pressure.
I can see a mention of SQLite and MySQL. Also worth mentioning since this can affect any system:
I wonder how Oracle handles this. Raw device/partition and its own FS business logic?
Direct IO is the default for raw devices, and while apparently not the default otherwise IIRC it's pretty widely used by Oracle shops.
While oracle supports raw devices, and have their own reasonably good way of managing that, most installations I see used is files for storage.
(Which leaves the question of how oracle handles this unanswered, of course)
There is an old paper http://pages.cs.wisc.edu/~remzi/Classes/736/Papers/iron.pdf that analyzes how various filesystems do error handling and not too long ago it was fairly bad. My own experience was some of the older Windows would not even check if a write command failed. I used to laugh when developers would state how robust their databases were when the underlying filesystem does not even check for many errors. Hopefully things are better now.
FWIW MongoDB fixed this last year in their WiredTiger storage engine.
The kernel itself isn't really a transaction manager. If there is an I/O error on a file, then I'd only expect it to percolate up in that immediate "session". When should that error flag be expected to carry over until, filesystem remount? Or even longer, with the talk of storing a flag in the superblock?
Specifically it seems like asking for trouble to open() a path a second time, and expect fsync() calls to apply to writes done on the other file descriptor object - there's no guarantee they're even the same file ! At the very least, pass the actual file descriptor to the other process. Linux's initial behavior was in violation of what I've said here, but the patch to propagate the error to every fd open at the time of the error should resolve this to satisfaction.
I would think about the only reasonable assumption you could make about concurrent cross-fd syncing is that after a successful fsync(fd), any successful read(fd) should return data that was on disk at the time of that fsync(fd) or later. In other words, dirty pages shouldn't be returned by read() and then subsequently fail being flushed out.
disclaimer: It's been a while since I've written syscall code and I've had the luxury of never really having to really lean into the spec to obtain every last bit of performance. Given how many guarantees the worse-is-better philosophy forewent, I don't see much point to internalize what few are there.
 Ya sure, you can assert that the user shouldn't be otherwise playing around with the files, I'm just pointing out that translating path->fd isn't a pure function.
Have there been any real-world consequences from this, and how can they be prevented?
Does MySQL have the same flaw?
The original report from Craig Ringer was based on a real system that experienced a failure like this on thin provisioning.
MySQL in buffered IO mode surely has the same problem, and has implemented the same solution: PANIC rather than retrying. Same for WiredTiger (MongoDB).
Other systems using direct (non buffered) IO are not affected.
This can happen when io device have intermittent failure. Bad device or faulty cable.
Interestingly enough, such issues are becoming more common. It's not just about devices being less reliable, but e.g. thin provisioning being used more widely etc.
I read this at lwn.net a while ago but it seems there is no fix to it. How is MySQL doing? I believe Oracle etc are not having this problem as they deal with disk directly.
I think that all databases are affected by this to some degree when not using Direct-IO, and I think Oracle and MySQL can both run with or without direct-IO.
Does anyone know if FoundationDB is affected?
FoundationDB does not use buffered IO, so no.
The meta point here is that just in OSS folks assume that the code has been reviewed since it's open, but the reality is that unless someone actually does you really don't know. The popularity of a project doesn't mean there aren't fatal flaws.
No need to bring out that dead horse for a beating. No one in this thread has yet made the claim that OSS code is flawless; you're picking a fight with a straw man.
It's a genuine issue, and one that is made time and again by people who think "open == security" whenever there's a discussion about something like Google or iMessage, when the armchair security experts come out of the woodwork to promote their favourite whatever-it-is.
Sure, it mightn't be made in this thread yet, but that doesn't make it an irrelevant, invalid, or uninteresting observation. I think that the spirit of discussion, so integral to what separates HN from other websites, means we should not poo-poo this line of inquiry just because you're bored of it.
I think the point is much often closer to "open is a prerequisite to security".
It's more like "open is a prerequisite for personal verification of security".
A system can be closed and secure, just you can't verify it.
Yes, but I tend to view security as a somewhat epistemological phenomenon. It's not enough for the security to exist "somewhere out there in the universe" in an absolute, objective sense. If you have no way of verifying it, it could simply be a lie, and is thus useless for threat modelling.
> Sure, it mightn't be made in this thread yet, but that doesn't make it an irrelevant, invalid, or uninteresting observation.
I really think it does. It's like "the sky isn't green!" or "the earth isn't flat!" or "vaccines don't cause autism!" Sure, these are all true things, but they weren't exactly topics of discussion on this thread before you brought them up.
By all means, discuss the article, and rebut comments you feel espouse an inaccurate worldview. (IMO) preemptive rebuttals like this are only useful or interesting when they're somewhat novel, or represent some special insight into a particular field that outsiders wouldn't have. This one has neither.
My particular take on why this dead horse is irrelevant (as well as tedious and boring):
Fsync isn't a security issue, it's a data loss issue. Arguably, the Postgres behavior is quite reasonable and the article's headline is just inaccurate. Linux has been reviewed, e.g., https://danluu.com/file-consistency/ from 2017, summarizing research from 2001-2014, all of which pointed towards deficiencies in its data preservation behavior. The Linux community know they lose data and propose that users should accept it.
The Postgres <-> Linux fsync investigation has been ongoing for a long time, with lots of eyeballs on both sides of the kernel/userspace boundary. This isn't really a "bug escapes major application developers for 20 years!" so much as "Linux can't agree to provide an API to make file data consistent."
> but these weren't exactly topics of discussion on this thread before you brought them up
Well, we're sorry we didn't recognise you as the discussion warden, but I think that's how a conversation works: people are free to bring up the points that they feel relevant, and people can either continue the train of thought or not. If it has no appeal to you, you're free to let it die a natural death rather than make pronouncements on what's relevant or not.
Humorously, someone in this thread has now actually made the claim that their open source database product has flawless crash reliability: https://news.ycombinator.com/item?id=19127011
Multiple independent research teams with their crash analysis/fuzzing tools confirmed the fact. Along with over 7 years deployment in large enterprises: Zero crash-induced corruption, zero startup/recovery time. Crash-proof by design.
Then make whatever appropriate response to that comment you feel is necessary; it still isn't a reasonable top-level comment.
but not that it's because it's free software, rather that it's because they read the docs.
No one is beating a dead horse here, there is a valid point - "The popularity of a project doesn't mean there aren't fatal flaws"
The point is valid, but uninteresting. It's the default. Everyone who has skimmed HN for more than two weeks has seen this by example, if not by comment, time and time again.
Personally I assume if a project is at least somewhat popular it works most of the time and frequent/serious bugs are reported and researchable. For that to work I also report or second bugs I encounter. That doesn't mean there aren't any bugs in even the most popular OSS, especially in edge cases and rare scenarios.
With closed source though I often don't know the popularty and how many/what kind of bugs have been reported, but just the reputation of the vendor.
I prefer a popular software from a highly reputable vendor over a somewhat popular OSS. But I also prefer the most popular and battletested OSS, like postgresql and linux, over any closed system, e.g. SQL Server and Windows.
Any project can contain fatal flaws irrespective of its review policies. Particularly when we are concerned with the subtle behaviour of the interaction between write and fsync. Even if you read the standard it's not clear exactly how the system as a whole should behave; there are a number of situations which aren't mentioned in the standard at all.
It would be quite possible for a review, and even extensive testing, to fail to pick up on some system-specific subtleties.
This is a contrarian view and I will sound phobic but for this very reason- false sense of security + possibility of malicious check ins- I now place less emphasis on open/closed source and more on repute. I am happier to download exe's from good companies these days than a package from arch-aur
It's a perfect scenario to hear some rants from Linus. While, the "code" has changed and he really has to calm down for this :)