I think the typical presentation of 4NF makes more sense when you consider the ergonomics of data processing systems when relational modeling emerged. Predating computing, data processing was all centered around "records." Relational databases are often also described in terms of "records," relational modeling emerged out of an attempt to formalize data processing practices, the practitioners of the era were learning relational modeling from a data processing background, so it's no surprise that relational concepts are often presented "as opposed to" the way things were done in data processing.
In traditional data processing systems, a "record" was far more independent than we think of in a relational database. Records were hand-punched onto cards or keyed onto magnetic tape to be sorted, summarized, or whatever operation you cared about by a data processing machine. In this environment, joins were extremely expensive operations, often requiring that the operator feed the card stack over and over again (a rather literal O(n^2)). So, the "weird composed" schema is absolutely what you would do. And a lot of computer software was built around the exact same concepts, which made sense anyway as computers often continued to use sequential-access storage devices with similar (but less severe) problems around joins. This era famously persisted for a long time, with relational modeling as an academic concept well predating successful implementations of relational databases.
One could argue that all of the normal forms are pretty much "stop doing it the old sequential access (card-stack) way and use random access concepts (keys and joins) instead."
Of course that leaves the question of whether or not we should teach it that way... we don't tend to tell students about how memory is now random access, so perhaps by the same turn the historical approach isn't useful for teaching here. But it would undoubtedly be more effective if you give a little of the history.
I wonder how many people actually have seen those index card shelves[1] IRL. Everyone have heard enough of music box analogy, the Tacoma Narrows, etc, but I don't remember this technology covered often enough.
One set of shelves has cards for books sorted by its ISBN. Another by titles, by authors, by keywords, ... libraries(of books, of employee records, of manufacturing blueprints...) would have had whole rooms worth of these shelves for searching and indexing.
Data normalization, SELECT, and JOIN, are just separating information on cards into separate shelves, hand picking cards, and laying out matching ones side by side on photocopiers. I've never had to mess with it, but apparently that was literally the database.
Records had to be batched into blocks and blocks had to be organized for fast access on a spinning drum or platter. This is why we have B+ trees for databases, and not simple binary trees or whatnot. You need the branching factor in order to reduce the seek times: each tree node traversal goes to a different block of records. If they are on the same track, it takes rotational latency to get to the block. If they are on a different track, seek time plus rotational latency.
Reducing rotational latency is why we got 7200 RPM and faster drives: going from 3600 to 7200 means any block is available twice as often at the drive head. It doesn't matter how many or how few have been crammed onto the track; density doesn't fix latency.
What might be needless is the "4" in "4NF," because the first three (and a half) are obsolete. Or rather, as you mention, were only relevant when things were read sequentially. "Normalization" is really fine on its own.
Things that aren't normalized are "ad hoc," or really just "input."
> Things that aren't normalized are "ad hoc," or really just "input."
I agree that every denormalization is "ad hoc". It may not even feel this way because sometimes such ad-hocs become intimately tied into the system performance architecture.
However I'm not sure if I agree with "just input", or maybe that's a different way of saying that: sometimes (often) you denormalize from 3NF to avoid a join. Often you can demonstrate that avoiding join is benefitial, by looking at query plan.
But the problem with teaching is that people become generally more afraid of joins, somehow thinking that every join is killing performance or something. ChatGPT produces a lot of FUD around joins and performance, unfortunately...
Thank you, the second paragraph of your comment is illuminating. There are some more replies in the same vein. I did not understand the mutual independence of "cards", and how you can use a different data layout if you only need to put it on a single card.
> with relational modeling as an academic concept well predating successful implementations of relational databases.
I’ve seen this design again and again, but not in relational databases. I usually see it in spreadsheets. I would be completely unsurprised to see a three column Excel sheet with Employee, Skill, and Language headers. (https://kerricklong.com/tmp/demo-4nf-need.jpg) The spreadsheet designer’s intuition is that Skill and Language are lists, and Employee is a section header. Smith might have a few rows with his languages and skills listed, one per cell, in arbitrary top-to-bottom order and with no consideration for how language and skill in a row relate to each other—only to Smith. The Employee column has a single merged cell for Smith, a number of rows tall it needs to be to contain the largest of his lists. When migrating this from a spreadsheet to a database, you lose merged cells as a feature and can no longer rely on order—so you copy the value Smith to each cell that had been merged.
This is a very good graphical demonstration (especially the merged cells), thank you.
It is also a variation "2b" from Kent's list (just for reference).
Perhaps TFA could explain what happened to math between 1977 and 2024. Because the meaning of "atomic" didn’t change, nor did the meaning of 1NF. The author pretends to explain 4NF, but never actually begins. He calls himself an "historian", which he might be. But he betrays very little knowledge of the relational model.
Repeating fields, which he proclaims as some kind of innovation, violate 1NF. They’re not atomic. Maybe the easiest way to understand it is that the list of languages can’t be used in SQL. If one teacher speaks Italian and French, SQL won’t find WHERE LANGUAGE = ‘FRENCH’. Nor will SQL join two teachers on that column.
SQL doesn’t recognize repeating fields in a column because it already has a way to represent them: as rows. That a teacher speaks two languages is not one fact, but two; not surprisingly, those two facts appear in two rows. ‘Twas true in in 1977. ‘Tis true today. Thus ‘twill ever be.
I mostly agree with you, however one could argue that character strings aren’t atomic as well. They are sequences (“arrays”, if you will) of characters, and query operators like LIKE demonstrate that strings aren’t just used as opaque values. You could in principle normalize character strings into a String table with columns “string ID”, “character position”, and “character value”. Somewhat similarly, types like DATETIME and INTERVAL can be regarded as non-atomic.
Whether a database column has a composite type is a trade-off regarding performance, storage requirements, and needing special operators for queries to work on the substructure of a composite value.
> Because the meaning of "atomic" didn’t change, nor did the meaning of 1NF.
So what is the meaning of "atomic"? Are array-typed values atomic? (I think you can treat it that way, a lot of lists are immutable in different programming languages.)
> But he betrays very little knowledge of the relational model.
This is absolutely possible. Let me ask a couple of questions to borrow from your understanding of relational model.
> Repeating fields, which he proclaims as some kind of innovation, violate 1NF. They’re not atomic.
Sorry, what is "repeating fields"? I don't remember ever using this, how does it look like? Are you talking about array-typed columns? If yes, then why are they not atomic?
Also, I don't think I proclaim absolutely anything is "kind of innovation"? Postgresql supports array-typed columns since 2001, and it was planned in 1986 already as part of other "post-relational" features.
Two schemas "(instructor_id, List<skill_id>); PK(instructor_id)", and "(instructor_id, skill_id); PK(instructor_id, skill_id)" are isomorphic, thus I may be wrong but I don't see any anomalies here.
> Maybe the easiest way to understand it is that the list of languages can’t be used in SQL. If one teacher speaks Italian and French, SQL won’t find WHERE LANGUAGE = ‘FRENCH’.
Ah, so you seem to be talking about array-typed columns. I mean, you can find it easily by using "WHERE languages @> ARRAY['FRENCH']". The difference is mostly syntactical.
If you think that you are not allowed to access single elements of "languages" array because it violates "atomicity" of the value, then we should also say that you're not allowed to use queries like "WHERE name LIKE 'A%'", because they violate the atomicity of a string value. Don't you agree?
> SQL doesn’t recognize repeating fields in a column
Array-typed columns appeared in ANSI SQL:1999, a quarter of century ago. I'm not sure what you're talking about.
Atomic means nondecomposable [1]. I dare say array-typed values are not atomic.
Re: "(instructor_id, List<skill_id>); PK(instructor_id)”, are your Lists sorted ? Are your Lists actually Sets?: what prevents `insert values (100, [2, 2, 1])` ?
If you think those are isomorphic, how about “(json_data); PK(json_data)” ? With, you know, all the same data in the JSON. Tomato, tomato ?
I’ll just link this [2] PDF that seems to describe the present
> Does this mean that SQL:1999 allows databases that do not satisfy first normal form? Indeed, it does, in the sense that it allows “repeating groups”, which first normal form prohibits. (However, some have argued that SQL:1999’s ARRAY type merely allows storage of information that can be decomposed, much as the SUBSTRING function can decompose character strings—and therefore doesn’t truly violate the spirit of first normal form.)
>20 years ago, when I started with databases, I found natural to map a user object in my software with a unique 'user' table with many columns. (Excel style)
It's only when I discovered how things were implemented and the required space and combinatorial complexity that the 4NF started to make sense.
I think the natural 'naive' approach is to have many columns with redundant information inside.
At some point, with more experience, we see how this system is harder to maintain (burden on the code logic) and how much space is wasted. I discovered 4NF (without giving it a name) by myself to solve those issues.
The problem is that we teach 4NF to people that never implemented code with databases and had to solve the problems it generates if you only have multiple columns. So the examples seems artificial.
When for the first time I saw 4NF in my lectures, it was easy as I knew all the issues without.
> I think the natural 'naive' approach is to have many columns with redundant information inside.
Interesting, thank you. I either don't remember that phase in my own learning, or I was lucky to read about the "correct" approach first. Maybe that's why I was so extra confused about 4NF explanations, haha.
I think it depends on what intuitions you start out with.
If you teach yourself programming as a kid like I have, and get to know about object-oriented programming very early on, normalization is more intuitive. "of course the addresses go in a separate table from the users, just as you'd have an `address` object instead of just putting all the fields inside `User`."
If you start out using Excel a lot, and learn programming much later, the Excel way probably seems right at first. "Of course there's going to be a table with customer information, one row per customer, with separate fields for street, city, zipcode etc."
You are right, now that I think about it, I was using php prior to their implementation of object oriented.
So I had no objects, the users were entities in my head, I had to keep track of their data, so a multi column entry in a database made totally sense to me at the time.
Also not stopping to think of further scenarios like shipping addresses. Like what if customer wants to store multiple addresses for shipping. How many of those they need? Is 10 enough? Do I now need street1 through street10?
If you maintain data records manually in Excel sheets, you tend to not “normalize” the data into several tables, because having to chase foreign-key references means you quickly lose track and don’t see everything related to a single entity in one place anymore. That’s because Excel doesn’t provide you with “joined” views. When inspecting a database interactively with a database GUI/TUI, this is very similar, unless you start building queries that lets you join the data together again (which is harder and less intuitive than just browsing tables and filtering/searching in them).
Of course, not normalizing the data makes it more laborious and error-prone to keep things consistent, the more non-1:1 data associations there are. When databases are not presented as an interactive table-browsing tool, but as an application’s data store, where the application can join data as needed, then the benefits of having everything related to an entity “in one place” (in a single table) are outweighed by the drawbacks of non-normalized data. However, this is not obvious when you start out in the table-browsing paradigm. People naively tend to structure the data the way they would for manual organization.
It was and is still common to have non experts design databases. And in many cases, normal form doesn't make sense. Tables with hundreds or thousands of columns are commonly the best solution.
What is rarely stressed about NF is that update logic must exist someplace and if you don't express the update rules in a database schema, it must be in the application code. Subtle errors are more likely in that case.
In the 1990s I was a certified Novell networking instructor. Every month their database mysteriously delist me. The problem was never found but would have been prevented if their db was in normal form. (Instead I was given the VP of education's direct phone number and he kept my records on his desk. As soon as I saw that I was delist, I would call him and he had someone reenter my data.)
Adding fields to tables and trying to update all the application code seems cheaper than redesigning the schema. At first. Later, you just need to normalize tables that have evolved "organically", so teaching the procedures for that is reasonable even in 2024.
> if you don't express the update rules in a database schema, it must be in the application code. Subtle errors are more likely in that case.
This is a common sentiment but I think that you can always demonstrate a use case that is very much practical, not convoluted, and could not be handled by classical features of relational databases, CHECK primarily (without arrays of any kinds).
Basically I'm currently pretty much sure that it's impossible to "make invalid states unrepresentable" using a classical relational model.
Also, I think that if your error feels subtle then you should elevate the formal model to make it less subtle. You can have "subtle" errors when the checks are encoded as database constraints too.
I suppose I don't know for sure. But normal forms prevent false data creation and data loss, and I know their db was not normalized from conversations with their IT support, so it's a pretty good guess.
"The existence of the theory implies that the “composed” design somehow arises, naively(?) or naturally."
Given how many first timers come up with really weird schema designs, I'm not necessarily surprised, although I agree presenting it as perhaps a default-but-wrong approach doesn't help much.
Not only beginners. I work with large teams with almost only seniors (10+ years of
programming job experience) for large corps who don't know what 4nf is; they just chuck stuff in a table, update the same data across multiple tables, don't use indices or cascading rules etc. Everything is in code; the db just stores it, badly. They threat everything like it's mongo and this doesn't even work in mongo (no indices and it dies as well of course).
4NF was explained confusingly because it's quite complex theory. The fact the author says 'to be best of his understanding' it is 4nf is kind of telling.
With many things in software dev; it all doesn't matter until it matter. With few records and few users, like the average department LoB crud thing, it's not really a problem; it will be when either or both grow. My 'colleagues' (I am external) claim to know all this stuff, but in reality there is no proof they do; I have never seen them do things correctly with respect to dbs or software in general. That makes me a lot of money, but it is depressing really.
I was initially going to add some stuff I find a bit... shocking in the 2020s related to what I think are basic database stuff (maybe not specifically 4nf related). Held off, but I'll add one.
A colleague had to argue that an 'address' table should have a synthetic primary key, instead of a natural key. The argument for the initial 'design' held that 'the address itself is unique - there's not 2 '123 west main streets'!". Wasn't my company directly, but the crux of the conversation was shared with me, and ... in 2022, with however many articles, tutorials, books, classes, stackoverflow, etc... having to argue that 'no, someone's entire address should not be the natural key for a table' was ... flummoxing to me.
> 4NF was explained confusingly because it's quite complex theory. The fact the author says 'to be best of his understanding' it is 4nf is kind of telling.
> My 'colleagues' (I am external) claim to know all this stuff, but in reality there is no proof they do; I have never seen them do things correctly with respect to dbs or software in general.
Wait I don't get it. Are you saying that "(instructor_id, skill_id); PK (instructor_id, skill_id)" (and the same for languages) is NOT in 4NF; and I am not right? What is in 4NF, then?
Performance characteristics of the underlying hardware dictate the software abstractions that make sense. In today's machines, we still have differential cost for sequential, aligned, rightly sized block io operations vs random io operations of random sizes. And we have a hierarchy of storages with different latency, performance and costs – cpu caches, ram – and ram, ssd, spinning disks – and these via local attached vs in disaggregated and distributed clusters.
So, if you want absolutely optimal performance, you still have to care about your column sizes and order of columns in your records and how your records are keyed and how your tables involved in join operations are organized and what kinds of new queries are likely in your application. This matters for both very small scale (embedded devices) and very large scale systems. For mid-scale systems with plenty of latency and cost margins, it matters a lot less than it used. Hence, we have the emergence of nosql in the last decade and distributed sql in this decade.
1. 4NF is dependent on context. This context is described as dependencies.
In the example of employees, skills and languages the design with two separate "link tables" employee_skills and employee_language is not obvious at all without specifying requirements _first_. If there is a requirement that an employee skill depends on the language (ie. an employee can type in a particular language only) - the design with a single 3 attributes table is the right one.
2. Multivalued attributes discussion is missing several important aspects for example referential integrity. It is not at all obvious what should happen when a language is deleted from languages table. Once you start adding rules and language to specify referential integrity for multivalued attributes you very quickly end up with... splitting multivalued attributes into separate relations.
> is not obvious at all without specifying requirements _first_. If there is a requirement that an employee skill depends on the language (ie. an employee can type in a particular language only) - the design with a single 3 attributes table is the right one.
In the "Baseline" section I do exactly that: specify requirements first. Nowhere does it say anything about skill depending on the language, so I assumed that it implies that they are independent.
"Suppose that we’re building a system that lets people find sports instructors. Each instructor has a number of skills they can teach: yoga, weightlifting, swimming, etc. Also, instructors can speak one or more languages: English, French, Italian, etc."
Not sure how to improve this wording so that it was more clear that skills and languages are independent.
> Multivalued attributes discussion is missing several important aspects for example referential integrity. It is not at all obvious what should happen when a language is deleted from languages table.
I'm not sure what you mean here because multivalued attributes as defined in 1977 preserve referential integrity. But array-typed values do not, you're right.
This is a very important consideration, thank you! I did not realize that (because I don't really think that classical enforced FKs are that crucial in modern practice.) But from a theory point of view having a list of IDs is not enforced, yes.
You can't enforce FKs if you just split a database into several, due to regulatory, performance/maintenance, organizational or architectural (m*croservices) reasons.
Enforced PKs in modern setting basically only work in a single uhmmm "transaction domain" (in a sense of ACID). You can certainly use two-phase commits or whatever, but that would kill performance so much that nobody (I think) seriously uses that.
Like, I think most modern "NoSQL" solution doesn't have enforced FKs.
Even in a single transaction domain you will get performance improvements if you get rid of enforced FKs. I remember a very old podcast (maybe 2005 or even earlier), a conversation with Werner Vogels where he explained how they were fixing performance issues with Oracle that they were using at that time. For me it was an eye-opener.
The story basically is that when they created an order, they allocated a new ID for an order, then inserted rows in the "order_items" table, and only at the end they inserted a row in "orders" table. All this was NOT in a transaction.
If there was an error during this, they just left the dangling "order_items" rows, and cleaned them up regularly with the script.
Otherwise, Amazon (the web shop) could not handle the load, and it was 20 years ago on a state-of-the-art Oracle relational database(r)(tm) with all Amazon's resources.
This is when I realized that God is dead and everything is allowed.
The fact that there are situations when implementing referential integrity is difficult does not mean you shouldn’t do it at all.
Secondly: relational model is based on logical-physical independence. Referential integrity is specified at the logical level and does not dictate any particular implementation. OTOH violating 1NF makes it impossible to even specify RI constraints.
NoSQL databases do not implement RI exactly because it is not possible to even specify it due to the lack of formal data model. And that’s not their strength but one of the many weaknesses.
The story about optimising Oracle performance by giving up data integrity is only an illustration of why you should not use Oracle :) Just use Postgres.
There is no problem with RI in Postgres as long as you remember to have an index on FK columns. Using CTEs allows you to perform data manipulation in a single statement with constraints enforced at the very end.
Similarily: giving up data integrity due to microservice architecture cargo cult is… tail wagging the dog.
> The fact that there are situations when implementing referential integrity is difficult does not mean you shouldn’t do it at all.
> Referential integrity is specified at the logical level and does not dictate any particular implementation. OTOH violating 1NF makes it impossible to even specify RI constraints.
Now we're bargaining. If referential integrity does not dictate any particular implementation, then having an application-level RI and no database-level constrains should also be acceptable. Yet this is not a common FK enthusiasts position.
We just had a discussion of DSQL where a lot of people were saying things like "no FKs thus no real database". Well, I dunno.
A possible clue is that the weird composed form is, still, the only possible form in which you can actually get results out of your SQL database.
Perhaps people worked backwards from sample query results, reasoning that an output that looks like that should come from a table that looks like that. Of course that begs the question of why that ever seemed like a sensible format for output, but apparently the SQL establishment has never seen fit to change it, so either the SQL people are all idiots or there's some advantage to that form.
(Yes, I know there is an obsolete meaning of that phrase still recorded in some dictionaries, but I prefer to use live English)
I was about to say I thought you might be mistaken because you can always use some variant of "group concat" to get sets of fields to show in comma seperated form...
But then I realized even with that, I couldn't get the output under discussion without subqueries. And I don't consider that a win for me.
When I learned quantum mechanics as part of a Chemistry BSc, it was broken down so as to be easier to digest. It made it much fucking harder for me. When I did it as a Phsyicist, they just laid out the math and it was super straightforward. I think back when they came up with 4NF, everyone was still trying to figure out what the fuck they were talking about, and so explanations were needlessly complicated.
Anyway, now with non-relational databases, 4NF can slow you down.
Excellent article. 1NF is part of the problem as the article mentions.
I'm seen some articles say that atomic in 1NF means you cannot have things that can be decomposed. All that matters is that you can perform the relational algebra on the table and get a sensible result.
To the why of presenting normalisation from those wacky forms, I think that also comes from data ingestion. You'll get a list of attributes for an employee from their CV and might naively just load that 'as is' into a table.
Indeed. You couldn’t store integers because those are decomposable into bits.
Atomicity was always a BS concept, and Chris Date and Hugh Darwen themselves have written as much on this topic. It didn’t help that although Edgar Codd had a math background, a lot of things published on the relational model just didn’t have formal rigor, but people interpreted it as such.
haha, I don't want to express disrespect to The Ancients, but I'm bewildered at the entire 3NF/BCNF thing, couldn't they just patch 3NF already by including the BCNF fix? Why do we need to bother with "3NF-but-not-BCNF" as a separate concept?
> a lot of things published on the relational model just didn’t have formal rigor, but people interpreted it as such.
I wonder if you may have something that substantiates this claim? I agree that relational theory doesn't look very complicated in principle, and it doesn't seem to be very "productive" as a theory. Compared to group theory, for example.
Well without atomicity you can't have 1NF, and thus none of the latter. So let's hope you never store a content of a text content of a configuration or log.
I believe it's a joke. However ... if you store a CV as a blob then separately store skills, experiences in a different table --- does that make the db design imperfect?
Yeah, I was thinking about the way how to actually build a non-1NF database.
So far my best idea is to store sqlite files as blobs in a different database.
Even that is 1NF, probably. As long as you don't have the id,blob duplicated in a table.
1NF (depending on the definition) has a few subtleties that make it either impossible to avoid in any relational database or almost impossible to enforce!
By impossible to avoid, I mean the basic rectangular shape of a table.
By impossible to enforce, I mean the avoidance of duplicates (without indexes) and NULLs.
> I'm seen some articles say that atomic in 1NF means you cannot have things that can be decomposed.
Yeah, this seems to be an vague argument, but it's not clear what it means. If I write "WHERE name LIKE 'A%'", does it mean that I "decompose" the string?
I don't see a difference between that and "where the column with array of languages contains string 'fr'".
My goto example is phone numbers; typically you store the whole thing, but what if you want to do analysis on zones? You'd just extract the zone, not redesign your schema and rebuild the database.
You could argue that things like LIKE and SUBSTR etc are not really part of the relational model at all, but merely application code leaking into the SQL for convenience.
Therefore, you shouldn't expect / demand that the results are a formal relation.
The origin of classic explanation probably comes from people attempting to computerize paper forms. The school would have profiles for each teacher, probably a series of pages in a manila folder stored in a filing cabinet, that was filled out by hand/a typewriter. A space for the name. A series of blanks to be filled in with a list of skills (that maxes out at however many blanks they could put on the page). A series of blanks for languages. Woe is the teacher who knows more languages than the available blanks that are on the form!
People still create tables today (yes, even in 2024) like (teacher, skill, language) as the row definition. Someone looks at the list of information they are collecting, conclude that the only axis they need to query on is the teacher, and make wide tables with many columns with disjoint purposes, that are then difficult to query and are inflexible.
Consider a library card catalog and the Dewey Decimal Classification system. The cards are arranged in the drawers by the topics standardized in the DDC. It looks like the major axis is the DDC numbering and their associated topics. While a card catalog lets you search for specific books if you know the topic and the author, it can not easily tell you all the books by the same author, or the list of topics that an author covers. Or find all the books that multiple authors have written together. But if one was to take the card catalog as it existed and computerize it, the naïve implementation would look like the unnormalized or lower normalization forms. The explanation of the progression of increasing normalization tells you how to incrementally achieve normalization given a data schema that was heavily influenced by the limitations of physical reality, such that a card catalog is, into a system that does not have those constraints.
The example of storing the skills and language as a list in a column is grossly inefficient in usage (ignoring the inefficiencies in storage and performance) and ignores that the "list" in 4NF is actually the result set, the set of rows returned. I suppose it could help one to think of the result set as horizontal columns (a list) rather than a vertical set of rows, but that's more of a side effect of the data presentation than the relational algebra underpinnings. Despite that databases like postgres let you query and index on expressions applied to a JSON-typed column, you end up with something more complex, with different methods of enforcing data hygiene at the row and the individual column level, because you've got sets of data in a single column, when the set is the defined by the rows. 4NF lets you answer way more questions about the data than you might initially anticipate. I've worked with a number of schemas that someone else created with no or little normalization that literally could not answer the desired questions. In this example, improper normalization results in finding out the skills of a teacher is easy, but finding out if the school is weak in any areas and what they should hire for is hard. But when the data is 4NF, you can answer questions easily in both directions without jumping through any data conversion hoops and you can have high confidence that the data remains hygienic.
What if the languages available do depend on what the lesson is about? That would require the composed form. Skipping it entirely is not the way to go.
You could start with decomposed and then compose it, I guess. I don't think either order is inherently easier to learn.
I agree that the wording of the normal forms, before getting to examples, is usually confusing.
> What if the languages available do depend on what the lesson is about?
Then it would be a completely different schema that would just be in 3NF.
If you look at Kent's variations you will see that they work hard (graphically) at making sure that you do not assume dependency between language and lesson.
The point is, the schema is not inherently unreasonable as the article is implying. There's a reason to make the comparison and take both versions seriously, not just show off the normalized version.
For once: you _always_ have to deal with memory hierarchy as there is no way to fit all data to the fastest memory. So you need to design with this in mind.
Secondly: the point of databases is not to implement a single specific task efficiently but:
a) provide means to maintain data so that it is easy to implement many different tasks (both current and future)
b) to safely _share_ data between various tasks/programs/users
There are all kinds of situations in which, if all your data fits into dynamic RAM, you don't have to give any other damn. (This is basically the bulk of the premise of technology like Python, for instance).
Nobody ever rebooted their machine because it became unusable due to their L1 or L2 cache thrashing.
You really have to care more about that particular hierarchy level transition, RAM to external storage, more than the others, because traditionally the transition is very steep, like suddenly hitting a brick wall of slowness.
The higher levels hierarchies can be left to generic management mechanisms, like the hardware itself or compiler register allocations. The ram to storage transition is managed at run-time by software. Databases have to bypass the operating system version of that management, and provide their own, because it is so critical.
> Every time I read explanations of any database anything, it looks childishly simplistic to what I've done with C pointers and structs.
Basically it all exists because of storage devices:
One of the key motivations for relational was to have a standardized system for data storage so data consumption was simpler.
Prior to relational every system had it's own unique method of storing data (e.g. your pointers and structs, and other things etc.) which required unique custom programming for every new data consumption use case.
With relational, there is a simpler (not simple, but much simpler) method of access that no longer requires a developer write a custom program to get the data for each new request.
> Once you have fast random access, it's data structures and algorithms: forget the database cruft.
I mean, yeah, but in practice you have to provide ACID or other invariants, you have to provide distributed processing and storage, you have to provide replication etc., etc.
Also, "once you have fast random access" sounds like "once you have gajillion dollars". I too want a pony!
I think the typical presentation of 4NF makes more sense when you consider the ergonomics of data processing systems when relational modeling emerged. Predating computing, data processing was all centered around "records." Relational databases are often also described in terms of "records," relational modeling emerged out of an attempt to formalize data processing practices, the practitioners of the era were learning relational modeling from a data processing background, so it's no surprise that relational concepts are often presented "as opposed to" the way things were done in data processing.
In traditional data processing systems, a "record" was far more independent than we think of in a relational database. Records were hand-punched onto cards or keyed onto magnetic tape to be sorted, summarized, or whatever operation you cared about by a data processing machine. In this environment, joins were extremely expensive operations, often requiring that the operator feed the card stack over and over again (a rather literal O(n^2)). So, the "weird composed" schema is absolutely what you would do. And a lot of computer software was built around the exact same concepts, which made sense anyway as computers often continued to use sequential-access storage devices with similar (but less severe) problems around joins. This era famously persisted for a long time, with relational modeling as an academic concept well predating successful implementations of relational databases.
One could argue that all of the normal forms are pretty much "stop doing it the old sequential access (card-stack) way and use random access concepts (keys and joins) instead."
Of course that leaves the question of whether or not we should teach it that way... we don't tend to tell students about how memory is now random access, so perhaps by the same turn the historical approach isn't useful for teaching here. But it would undoubtedly be more effective if you give a little of the history.
I wonder how many people actually have seen those index card shelves[1] IRL. Everyone have heard enough of music box analogy, the Tacoma Narrows, etc, but I don't remember this technology covered often enough.
One set of shelves has cards for books sorted by its ISBN. Another by titles, by authors, by keywords, ... libraries(of books, of employee records, of manufacturing blueprints...) would have had whole rooms worth of these shelves for searching and indexing.
Data normalization, SELECT, and JOIN, are just separating information on cards into separate shelves, hand picking cards, and laying out matching ones side by side on photocopiers. I've never had to mess with it, but apparently that was literally the database.
1: https://en.wikipedia.org/wiki/Library_catalog
Records had to be batched into blocks and blocks had to be organized for fast access on a spinning drum or platter. This is why we have B+ trees for databases, and not simple binary trees or whatnot. You need the branching factor in order to reduce the seek times: each tree node traversal goes to a different block of records. If they are on the same track, it takes rotational latency to get to the block. If they are on a different track, seek time plus rotational latency.
Reducing rotational latency is why we got 7200 RPM and faster drives: going from 3600 to 7200 means any block is available twice as often at the drive head. It doesn't matter how many or how few have been crammed onto the track; density doesn't fix latency.
What might be needless is the "4" in "4NF," because the first three (and a half) are obsolete. Or rather, as you mention, were only relevant when things were read sequentially. "Normalization" is really fine on its own.
Things that aren't normalized are "ad hoc," or really just "input."
Sequential is still faster in a lot of cases even on SSDs and RAM (prefetching)
> Things that aren't normalized are "ad hoc," or really just "input."
I agree that every denormalization is "ad hoc". It may not even feel this way because sometimes such ad-hocs become intimately tied into the system performance architecture.
However I'm not sure if I agree with "just input", or maybe that's a different way of saying that: sometimes (often) you denormalize from 3NF to avoid a join. Often you can demonstrate that avoiding join is benefitial, by looking at query plan.
But the problem with teaching is that people become generally more afraid of joins, somehow thinking that every join is killing performance or something. ChatGPT produces a lot of FUD around joins and performance, unfortunately...
Thank you, the second paragraph of your comment is illuminating. There are some more replies in the same vein. I did not understand the mutual independence of "cards", and how you can use a different data layout if you only need to put it on a single card.
> with relational modeling as an academic concept well predating successful implementations of relational databases.
This is also non-obvious, thank you!
I’ve seen this design again and again, but not in relational databases. I usually see it in spreadsheets. I would be completely unsurprised to see a three column Excel sheet with Employee, Skill, and Language headers. (https://kerricklong.com/tmp/demo-4nf-need.jpg) The spreadsheet designer’s intuition is that Skill and Language are lists, and Employee is a section header. Smith might have a few rows with his languages and skills listed, one per cell, in arbitrary top-to-bottom order and with no consideration for how language and skill in a row relate to each other—only to Smith. The Employee column has a single merged cell for Smith, a number of rows tall it needs to be to contain the largest of his lists. When migrating this from a spreadsheet to a database, you lose merged cells as a feature and can no longer rely on order—so you copy the value Smith to each cell that had been merged.
This is a very good graphical demonstration (especially the merged cells), thank you. It is also a variation "2b" from Kent's list (just for reference).
I think that the textual demonstration of the same effect could be found in https://www.cargocultcode.com/normalization-is-not-a-process... ("So where did it begin?" section). I rediscovered this blog post after I published the article.
Perhaps TFA could explain what happened to math between 1977 and 2024. Because the meaning of "atomic" didn’t change, nor did the meaning of 1NF. The author pretends to explain 4NF, but never actually begins. He calls himself an "historian", which he might be. But he betrays very little knowledge of the relational model.
Repeating fields, which he proclaims as some kind of innovation, violate 1NF. They’re not atomic. Maybe the easiest way to understand it is that the list of languages can’t be used in SQL. If one teacher speaks Italian and French, SQL won’t find WHERE LANGUAGE = ‘FRENCH’. Nor will SQL join two teachers on that column.
SQL doesn’t recognize repeating fields in a column because it already has a way to represent them: as rows. That a teacher speaks two languages is not one fact, but two; not surprisingly, those two facts appear in two rows. ‘Twas true in in 1977. ‘Tis true today. Thus ‘twill ever be.
I mostly agree with you, however one could argue that character strings aren’t atomic as well. They are sequences (“arrays”, if you will) of characters, and query operators like LIKE demonstrate that strings aren’t just used as opaque values. You could in principle normalize character strings into a String table with columns “string ID”, “character position”, and “character value”. Somewhat similarly, types like DATETIME and INTERVAL can be regarded as non-atomic.
Whether a database column has a composite type is a trade-off regarding performance, storage requirements, and needing special operators for queries to work on the substructure of a composite value.
> Because the meaning of "atomic" didn’t change, nor did the meaning of 1NF.
So what is the meaning of "atomic"? Are array-typed values atomic? (I think you can treat it that way, a lot of lists are immutable in different programming languages.)
> But he betrays very little knowledge of the relational model.
This is absolutely possible. Let me ask a couple of questions to borrow from your understanding of relational model.
> Repeating fields, which he proclaims as some kind of innovation, violate 1NF. They’re not atomic.
Sorry, what is "repeating fields"? I don't remember ever using this, how does it look like? Are you talking about array-typed columns? If yes, then why are they not atomic?
Also, I don't think I proclaim absolutely anything is "kind of innovation"? Postgresql supports array-typed columns since 2001, and it was planned in 1986 already as part of other "post-relational" features.
Two schemas "(instructor_id, List<skill_id>); PK(instructor_id)", and "(instructor_id, skill_id); PK(instructor_id, skill_id)" are isomorphic, thus I may be wrong but I don't see any anomalies here.
> Maybe the easiest way to understand it is that the list of languages can’t be used in SQL. If one teacher speaks Italian and French, SQL won’t find WHERE LANGUAGE = ‘FRENCH’.
Ah, so you seem to be talking about array-typed columns. I mean, you can find it easily by using "WHERE languages @> ARRAY['FRENCH']". The difference is mostly syntactical.
If you think that you are not allowed to access single elements of "languages" array because it violates "atomicity" of the value, then we should also say that you're not allowed to use queries like "WHERE name LIKE 'A%'", because they violate the atomicity of a string value. Don't you agree?
> SQL doesn’t recognize repeating fields in a column
Array-typed columns appeared in ANSI SQL:1999, a quarter of century ago. I'm not sure what you're talking about.
Atomic means nondecomposable [1]. I dare say array-typed values are not atomic.
Re: "(instructor_id, List<skill_id>); PK(instructor_id)”, are your Lists sorted ? Are your Lists actually Sets?: what prevents `insert values (100, [2, 2, 1])` ?
If you think those are isomorphic, how about “(json_data); PK(json_data)” ? With, you know, all the same data in the JSON. Tomato, tomato ?
I’ll just link this [2] PDF that seems to describe the present
> Does this mean that SQL:1999 allows databases that do not satisfy first normal form? Indeed, it does, in the sense that it allows “repeating groups”, which first normal form prohibits. (However, some have argued that SQL:1999’s ARRAY type merely allows storage of information that can be decomposed, much as the SUBSTRING function can decompose character strings—and therefore doesn’t truly violate the spirit of first normal form.)
[1] at least, according to Codd
[2] https://www.cl.cam.ac.uk/teaching/0304/Databases/sql1999.pdf
>20 years ago, when I started with databases, I found natural to map a user object in my software with a unique 'user' table with many columns. (Excel style)
It's only when I discovered how things were implemented and the required space and combinatorial complexity that the 4NF started to make sense.
I think the natural 'naive' approach is to have many columns with redundant information inside.
At some point, with more experience, we see how this system is harder to maintain (burden on the code logic) and how much space is wasted. I discovered 4NF (without giving it a name) by myself to solve those issues.
The problem is that we teach 4NF to people that never implemented code with databases and had to solve the problems it generates if you only have multiple columns. So the examples seems artificial.
When for the first time I saw 4NF in my lectures, it was easy as I knew all the issues without.
> I think the natural 'naive' approach is to have many columns with redundant information inside.
Interesting, thank you. I either don't remember that phase in my own learning, or I was lucky to read about the "correct" approach first. Maybe that's why I was so extra confused about 4NF explanations, haha.
I think it depends on what intuitions you start out with.
If you teach yourself programming as a kid like I have, and get to know about object-oriented programming very early on, normalization is more intuitive. "of course the addresses go in a separate table from the users, just as you'd have an `address` object instead of just putting all the fields inside `User`."
If you start out using Excel a lot, and learn programming much later, the Excel way probably seems right at first. "Of course there's going to be a table with customer information, one row per customer, with separate fields for street, city, zipcode etc."
You are right, now that I think about it, I was using php prior to their implementation of object oriented.
So I had no objects, the users were entities in my head, I had to keep track of their data, so a multi column entry in a database made totally sense to me at the time.
Also not stopping to think of further scenarios like shipping addresses. Like what if customer wants to store multiple addresses for shipping. How many of those they need? Is 10 enough? Do I now need street1 through street10?
Ahh true, object oriented approach certainly helps making sense, and I assumed it by default. I'll think about that.
If you maintain data records manually in Excel sheets, you tend to not “normalize” the data into several tables, because having to chase foreign-key references means you quickly lose track and don’t see everything related to a single entity in one place anymore. That’s because Excel doesn’t provide you with “joined” views. When inspecting a database interactively with a database GUI/TUI, this is very similar, unless you start building queries that lets you join the data together again (which is harder and less intuitive than just browsing tables and filtering/searching in them).
Of course, not normalizing the data makes it more laborious and error-prone to keep things consistent, the more non-1:1 data associations there are. When databases are not presented as an interactive table-browsing tool, but as an application’s data store, where the application can join data as needed, then the benefits of having everything related to an entity “in one place” (in a single table) are outweighed by the drawbacks of non-normalized data. However, this is not obvious when you start out in the table-browsing paradigm. People naively tend to structure the data the way they would for manual organization.
It was and is still common to have non experts design databases. And in many cases, normal form doesn't make sense. Tables with hundreds or thousands of columns are commonly the best solution.
What is rarely stressed about NF is that update logic must exist someplace and if you don't express the update rules in a database schema, it must be in the application code. Subtle errors are more likely in that case.
In the 1990s I was a certified Novell networking instructor. Every month their database mysteriously delist me. The problem was never found but would have been prevented if their db was in normal form. (Instead I was given the VP of education's direct phone number and he kept my records on his desk. As soon as I saw that I was delist, I would call him and he had someone reenter my data.)
Adding fields to tables and trying to update all the application code seems cheaper than redesigning the schema. At first. Later, you just need to normalize tables that have evolved "organically", so teaching the procedures for that is reasonable even in 2024.
> if you don't express the update rules in a database schema, it must be in the application code. Subtle errors are more likely in that case.
This is a common sentiment but I think that you can always demonstrate a use case that is very much practical, not convoluted, and could not be handled by classical features of relational databases, CHECK primarily (without arrays of any kinds).
A couple of examples is outlined here: https://minimalmodeling.substack.com/i/31184249/more-on-stru...
Basically I'm currently pretty much sure that it's impossible to "make invalid states unrepresentable" using a classical relational model.
Also, I think that if your error feels subtle then you should elevate the formal model to make it less subtle. You can have "subtle" errors when the checks are encoded as database constraints too.
>Basically I'm currently pretty much sure that it's impossible to "make invalid states unrepresentable" using a classical relational model.
But that's what 9th NF is for! :-)
> The problem was never found but would have been prevented if their db was in normal form.
Sorry to be pedantic - if they didn't find the problem, how do we know that was the solution?
I suppose I don't know for sure. But normal forms prevent false data creation and data loss, and I know their db was not normalized from conversations with their IT support, so it's a pretty good guess.
"The existence of the theory implies that the “composed” design somehow arises, naively(?) or naturally."
Given how many first timers come up with really weird schema designs, I'm not necessarily surprised, although I agree presenting it as perhaps a default-but-wrong approach doesn't help much.
Not only beginners. I work with large teams with almost only seniors (10+ years of programming job experience) for large corps who don't know what 4nf is; they just chuck stuff in a table, update the same data across multiple tables, don't use indices or cascading rules etc. Everything is in code; the db just stores it, badly. They threat everything like it's mongo and this doesn't even work in mongo (no indices and it dies as well of course).
4NF was explained confusingly because it's quite complex theory. The fact the author says 'to be best of his understanding' it is 4nf is kind of telling.
With many things in software dev; it all doesn't matter until it matter. With few records and few users, like the average department LoB crud thing, it's not really a problem; it will be when either or both grow. My 'colleagues' (I am external) claim to know all this stuff, but in reality there is no proof they do; I have never seen them do things correctly with respect to dbs or software in general. That makes me a lot of money, but it is depressing really.
I was initially going to add some stuff I find a bit... shocking in the 2020s related to what I think are basic database stuff (maybe not specifically 4nf related). Held off, but I'll add one.
A colleague had to argue that an 'address' table should have a synthetic primary key, instead of a natural key. The argument for the initial 'design' held that 'the address itself is unique - there's not 2 '123 west main streets'!". Wasn't my company directly, but the crux of the conversation was shared with me, and ... in 2022, with however many articles, tutorials, books, classes, stackoverflow, etc... having to argue that 'no, someone's entire address should not be the natural key for a table' was ... flummoxing to me.
> 4NF was explained confusingly because it's quite complex theory. The fact the author says 'to be best of his understanding' it is 4nf is kind of telling.
> My 'colleagues' (I am external) claim to know all this stuff, but in reality there is no proof they do; I have never seen them do things correctly with respect to dbs or software in general.
Wait I don't get it. Are you saying that "(instructor_id, skill_id); PK (instructor_id, skill_id)" (and the same for languages) is NOT in 4NF; and I am not right? What is in 4NF, then?
I think you misunderstood me? I am talking about my colleagues... Or the many different colleagues I have throughout the year.
Ah, okay, yes, sorry, thank you!
Performance characteristics of the underlying hardware dictate the software abstractions that make sense. In today's machines, we still have differential cost for sequential, aligned, rightly sized block io operations vs random io operations of random sizes. And we have a hierarchy of storages with different latency, performance and costs – cpu caches, ram – and ram, ssd, spinning disks – and these via local attached vs in disaggregated and distributed clusters. So, if you want absolutely optimal performance, you still have to care about your column sizes and order of columns in your records and how your records are keyed and how your tables involved in join operations are organized and what kinds of new queries are likely in your application. This matters for both very small scale (embedded devices) and very large scale systems. For mid-scale systems with plenty of latency and cost margins, it matters a lot less than it used. Hence, we have the emergence of nosql in the last decade and distributed sql in this decade.
There are several issues with this:
1. 4NF is dependent on context. This context is described as dependencies.
In the example of employees, skills and languages the design with two separate "link tables" employee_skills and employee_language is not obvious at all without specifying requirements _first_. If there is a requirement that an employee skill depends on the language (ie. an employee can type in a particular language only) - the design with a single 3 attributes table is the right one.
2. Multivalued attributes discussion is missing several important aspects for example referential integrity. It is not at all obvious what should happen when a language is deleted from languages table. Once you start adding rules and language to specify referential integrity for multivalued attributes you very quickly end up with... splitting multivalued attributes into separate relations.
> is not obvious at all without specifying requirements _first_. If there is a requirement that an employee skill depends on the language (ie. an employee can type in a particular language only) - the design with a single 3 attributes table is the right one.
In the "Baseline" section I do exactly that: specify requirements first. Nowhere does it say anything about skill depending on the language, so I assumed that it implies that they are independent.
"Suppose that we’re building a system that lets people find sports instructors. Each instructor has a number of skills they can teach: yoga, weightlifting, swimming, etc. Also, instructors can speak one or more languages: English, French, Italian, etc."
Not sure how to improve this wording so that it was more clear that skills and languages are independent.
> Multivalued attributes discussion is missing several important aspects for example referential integrity. It is not at all obvious what should happen when a language is deleted from languages table.
I'm not sure what you mean here because multivalued attributes as defined in 1977 preserve referential integrity. But array-typed values do not, you're right.
This is a very important consideration, thank you! I did not realize that (because I don't really think that classical enforced FKs are that crucial in modern practice.) But from a theory point of view having a list of IDs is not enforced, yes.
It would take a really long time to count all the issues caused by dangling references to nonexistent rows I had to deal with in my career.
So yes - classical enforced FKs are still important - even if practice became „modern” (whatever that means )
You can't enforce FKs if you just split a database into several, due to regulatory, performance/maintenance, organizational or architectural (m*croservices) reasons.
Enforced PKs in modern setting basically only work in a single uhmmm "transaction domain" (in a sense of ACID). You can certainly use two-phase commits or whatever, but that would kill performance so much that nobody (I think) seriously uses that.
Like, I think most modern "NoSQL" solution doesn't have enforced FKs.
Even in a single transaction domain you will get performance improvements if you get rid of enforced FKs. I remember a very old podcast (maybe 2005 or even earlier), a conversation with Werner Vogels where he explained how they were fixing performance issues with Oracle that they were using at that time. For me it was an eye-opener.
The story basically is that when they created an order, they allocated a new ID for an order, then inserted rows in the "order_items" table, and only at the end they inserted a row in "orders" table. All this was NOT in a transaction.
If there was an error during this, they just left the dangling "order_items" rows, and cleaned them up regularly with the script.
Otherwise, Amazon (the web shop) could not handle the load, and it was 20 years ago on a state-of-the-art Oracle relational database(r)(tm) with all Amazon's resources.
This is when I realized that God is dead and everything is allowed.
This is a false alternative:
The fact that there are situations when implementing referential integrity is difficult does not mean you shouldn’t do it at all.
Secondly: relational model is based on logical-physical independence. Referential integrity is specified at the logical level and does not dictate any particular implementation. OTOH violating 1NF makes it impossible to even specify RI constraints.
NoSQL databases do not implement RI exactly because it is not possible to even specify it due to the lack of formal data model. And that’s not their strength but one of the many weaknesses.
The story about optimising Oracle performance by giving up data integrity is only an illustration of why you should not use Oracle :) Just use Postgres.
There is no problem with RI in Postgres as long as you remember to have an index on FK columns. Using CTEs allows you to perform data manipulation in a single statement with constraints enforced at the very end.
Similarily: giving up data integrity due to microservice architecture cargo cult is… tail wagging the dog.
> The fact that there are situations when implementing referential integrity is difficult does not mean you shouldn’t do it at all.
> Referential integrity is specified at the logical level and does not dictate any particular implementation. OTOH violating 1NF makes it impossible to even specify RI constraints.
Now we're bargaining. If referential integrity does not dictate any particular implementation, then having an application-level RI and no database-level constrains should also be acceptable. Yet this is not a common FK enthusiasts position.
We just had a discussion of DSQL where a lot of people were saying things like "no FKs thus no real database". Well, I dunno.
A possible clue is that the weird composed form is, still, the only possible form in which you can actually get results out of your SQL database.
Perhaps people worked backwards from sample query results, reasoning that an output that looks like that should come from a table that looks like that. Of course that begs the question of why that ever seemed like a sensible format for output, but apparently the SQL establishment has never seen fit to change it, so either the SQL people are all idiots or there's some advantage to that form.
(Yes, I know there is an obsolete meaning of that phrase still recorded in some dictionaries, but I prefer to use live English)
I was about to say I thought you might be mistaken because you can always use some variant of "group concat" to get sets of fields to show in comma seperated form...
But then I realized even with that, I couldn't get the output under discussion without subqueries. And I don't consider that a win for me.
When I learned quantum mechanics as part of a Chemistry BSc, it was broken down so as to be easier to digest. It made it much fucking harder for me. When I did it as a Phsyicist, they just laid out the math and it was super straightforward. I think back when they came up with 4NF, everyone was still trying to figure out what the fuck they were talking about, and so explanations were needlessly complicated.
Anyway, now with non-relational databases, 4NF can slow you down.
Excellent article. 1NF is part of the problem as the article mentions.
I'm seen some articles say that atomic in 1NF means you cannot have things that can be decomposed. All that matters is that you can perform the relational algebra on the table and get a sensible result.
To the why of presenting normalisation from those wacky forms, I think that also comes from data ingestion. You'll get a list of attributes for an employee from their CV and might naively just load that 'as is' into a table.
> you cannot have things that can be decomposed.
Indeed. You couldn’t store integers because those are decomposable into bits.
Atomicity was always a BS concept, and Chris Date and Hugh Darwen themselves have written as much on this topic. It didn’t help that although Edgar Codd had a math background, a lot of things published on the relational model just didn’t have formal rigor, but people interpreted it as such.
haha, I don't want to express disrespect to The Ancients, but I'm bewildered at the entire 3NF/BCNF thing, couldn't they just patch 3NF already by including the BCNF fix? Why do we need to bother with "3NF-but-not-BCNF" as a separate concept?
> a lot of things published on the relational model just didn’t have formal rigor, but people interpreted it as such.
I wonder if you may have something that substantiates this claim? I agree that relational theory doesn't look very complicated in principle, and it doesn't seem to be very "productive" as a theory. Compared to group theory, for example.
Technically you're storing the order of the bits, which isn't decomposable.
Well without atomicity you can't have 1NF, and thus none of the latter. So let's hope you never store a content of a text content of a configuration or log.
I don't understand that comment. What do you mean "let's hope" if I can easily store a text content of configuration or log into a column?
I believe it's a joke. However ... if you store a CV as a blob then separately store skills, experiences in a different table --- does that make the db design imperfect?
Ah, okay.
Yeah, I was thinking about the way how to actually build a non-1NF database. So far my best idea is to store sqlite files as blobs in a different database.
Even that is 1NF, probably. As long as you don't have the id,blob duplicated in a table.
1NF (depending on the definition) has a few subtleties that make it either impossible to avoid in any relational database or almost impossible to enforce!
By impossible to avoid, I mean the basic rectangular shape of a table.
By impossible to enforce, I mean the avoidance of duplicates (without indexes) and NULLs.
Thank you!
> I'm seen some articles say that atomic in 1NF means you cannot have things that can be decomposed.
Yeah, this seems to be an vague argument, but it's not clear what it means. If I write "WHERE name LIKE 'A%'", does it mean that I "decompose" the string?
I don't see a difference between that and "where the column with array of languages contains string 'fr'".
My goto example is phone numbers; typically you store the whole thing, but what if you want to do analysis on zones? You'd just extract the zone, not redesign your schema and rebuild the database.
You could argue that things like LIKE and SUBSTR etc are not really part of the relational model at all, but merely application code leaking into the SQL for convenience.
Therefore, you shouldn't expect / demand that the results are a formal relation.
Database normalization: https://en.wikipedia.org/wiki/Database_normalization
4NF: https://en.wikipedia.org/wiki/Fourth_normal_form
The origin of classic explanation probably comes from people attempting to computerize paper forms. The school would have profiles for each teacher, probably a series of pages in a manila folder stored in a filing cabinet, that was filled out by hand/a typewriter. A space for the name. A series of blanks to be filled in with a list of skills (that maxes out at however many blanks they could put on the page). A series of blanks for languages. Woe is the teacher who knows more languages than the available blanks that are on the form!
People still create tables today (yes, even in 2024) like (teacher, skill, language) as the row definition. Someone looks at the list of information they are collecting, conclude that the only axis they need to query on is the teacher, and make wide tables with many columns with disjoint purposes, that are then difficult to query and are inflexible.
Consider a library card catalog and the Dewey Decimal Classification system. The cards are arranged in the drawers by the topics standardized in the DDC. It looks like the major axis is the DDC numbering and their associated topics. While a card catalog lets you search for specific books if you know the topic and the author, it can not easily tell you all the books by the same author, or the list of topics that an author covers. Or find all the books that multiple authors have written together. But if one was to take the card catalog as it existed and computerize it, the naïve implementation would look like the unnormalized or lower normalization forms. The explanation of the progression of increasing normalization tells you how to incrementally achieve normalization given a data schema that was heavily influenced by the limitations of physical reality, such that a card catalog is, into a system that does not have those constraints.
The example of storing the skills and language as a list in a column is grossly inefficient in usage (ignoring the inefficiencies in storage and performance) and ignores that the "list" in 4NF is actually the result set, the set of rows returned. I suppose it could help one to think of the result set as horizontal columns (a list) rather than a vertical set of rows, but that's more of a side effect of the data presentation than the relational algebra underpinnings. Despite that databases like postgres let you query and index on expressions applied to a JSON-typed column, you end up with something more complex, with different methods of enforcing data hygiene at the row and the individual column level, because you've got sets of data in a single column, when the set is the defined by the rows. 4NF lets you answer way more questions about the data than you might initially anticipate. I've worked with a number of schemas that someone else created with no or little normalization that literally could not answer the desired questions. In this example, improper normalization results in finding out the skills of a teacher is easy, but finding out if the school is weak in any areas and what they should hire for is hard. But when the data is 4NF, you can answer questions easily in both directions without jumping through any data conversion hoops and you can have high confidence that the data remains hygienic.
Thank you, the first three paragraphs with examples are very useful to think about.
What if the languages available do depend on what the lesson is about? That would require the composed form. Skipping it entirely is not the way to go.
You could start with decomposed and then compose it, I guess. I don't think either order is inherently easier to learn.
I agree that the wording of the normal forms, before getting to examples, is usually confusing.
> What if the languages available do depend on what the lesson is about?
Then it would be a completely different schema that would just be in 3NF.
If you look at Kent's variations you will see that they work hard (graphically) at making sure that you do not assume dependency between language and lesson.
The point is, the schema is not inherently unreasonable as the article is implying. There's a reason to make the comparison and take both versions seriously, not just show off the normalized version.
Yes, absolutely, just present it in "historical misunderstandings" section.
My only concern is that it is presented as if it was something natural to begin with.
Every time I read explanations of any database anything, it looks childishly simplistic to what I've done with C pointers and structs.
Basically it all exists because of storage devices:
- large transfer factors (spinning platter hard drive track access, flash erase blocks)
- latency (spinning platter track to track seek, and rotational latency)
Once you have fast random access, it's data structures and algorithms: forget the database cruft.
That’s a very simplistic view.
For once: you _always_ have to deal with memory hierarchy as there is no way to fit all data to the fastest memory. So you need to design with this in mind.
Secondly: the point of databases is not to implement a single specific task efficiently but:
a) provide means to maintain data so that it is easy to implement many different tasks (both current and future)
b) to safely _share_ data between various tasks/programs/users
There are all kinds of situations in which, if all your data fits into dynamic RAM, you don't have to give any other damn. (This is basically the bulk of the premise of technology like Python, for instance).
Nobody ever rebooted their machine because it became unusable due to their L1 or L2 cache thrashing.
You really have to care more about that particular hierarchy level transition, RAM to external storage, more than the others, because traditionally the transition is very steep, like suddenly hitting a brick wall of slowness.
The higher levels hierarchies can be left to generic management mechanisms, like the hardware itself or compiler register allocations. The ram to storage transition is managed at run-time by software. Databases have to bypass the operating system version of that management, and provide their own, because it is so critical.
> Every time I read explanations of any database anything, it looks childishly simplistic to what I've done with C pointers and structs. Basically it all exists because of storage devices:
One of the key motivations for relational was to have a standardized system for data storage so data consumption was simpler.
Prior to relational every system had it's own unique method of storing data (e.g. your pointers and structs, and other things etc.) which required unique custom programming for every new data consumption use case.
With relational, there is a simpler (not simple, but much simpler) method of access that no longer requires a developer write a custom program to get the data for each new request.
> Once you have fast random access, it's data structures and algorithms: forget the database cruft.
I mean, yeah, but in practice you have to provide ACID or other invariants, you have to provide distributed processing and storage, you have to provide replication etc., etc.
Also, "once you have fast random access" sounds like "once you have gajillion dollars". I too want a pony!
Systems which handle everything in memory can also be distributed and be required to provide consistency guarantees.
"Once you have fast random access" happens just from moving up a step in the memory hierarchy; that's what I mean.