6NF File Format

75 points by sergeyprokhoren a day ago

> habr.com

interesting to see this forum show-up again.

remember 15 years ago there were posts about DYI drone from some random guy with lots of theoretical physics about stable conditions derivations. it got a lot of criticism. now looking back and following what DJI is doing with sensors, his approach was totally wrong and that community nailed it with feedback. the forum got some extravagant ideas and some worthy criticism. at least back then.

artemonster a day ago

I remember visiting this site daily 10-15 years ago, in russian, ofc. The moderation was super high, karma system worked great, the content quality was astonishing. Then they switched up owners, tried heavily monetizing corpo-pseudo-blogpost-marketing crap and it all went downhill from there
- 0x457 a day ago
  
  It went downhill when they allowed getting an invitation via single blog post, requiring just one person to like it enough to give an invitation. Which wasn't hard to write - just translate something popular from hackernews before anyone else does it.
  Shortly after, it became hilariously easy to farm and manipulate karma balances across the entire site. With 50 accounts (mults or real people all the same) you could create a new account a day.
  Monetization started when it was already in a death spiral.
  
  NooneAtAll3 a day ago
  
  don't forget the awful redesign, including completely replacing post formatter
  all accumulated mastery of creating posts by experienced authors - gone overnight
- balamatom a day ago
  
  habr is an institution. it's like the "runet hn", minus wild west vc ecosystem, plus integrated blog posting like lj ogs intended to. probably helps a lot with original work like TFA getting traction. more power to that!
  runet sites of that era are often born out of the hacker's characteristic contrarian attitude "because we can". attempts to monetize them in more recent years are bound to accomplish little more than fuck up the content quality and/or the "owner cashes out and opens cafe" thing.
  nevertheless, to this day, when i think habrahabr, i think way higher bar for technical competence than hn. it's all in the attitude.
  
  wearable a day ago
  
  What are the modern equivalents of habr?
  
  throw-the-towel 17 hours ago
  
  There's probably none. The Russian Internet has been Eternal Septembered too much for something similar to appear.
  
  balamatom 11 hours ago
  
  if i knew any, i sure as fuck wouldn't post them on hn of all places.
- jojobas a day ago
  
  It was also notoriously politics-free, until something happened.

deepsun a day ago

Improvement idea -- in my experience "valid_from" is always a date (no time, no timezone). That's how it's reported in documents (e.g. contract validity period).

Rows that need seconds (e.g. bank transactions) are events, they aren't "valid" from a particular point in time forward, they just happen.

nine_k a day ago

In my experience, validity time may start at the start of business day, and likely has a specific time zone. In particular, I've seen how documents related to stock trading at NASDAQ specify Eastern Standard Time as the applicable timezone.
I understand how convenient it is to use UTC-only timestamps. It works in most cases but not all.
adammarples 14 hours ago

No point losing information like that. What do you do if someone opens and closes an account on the same day? Changes their email address three times in one day? Etc
- deepsun 11 hours ago
  
  Agree, if your updates have time then datetime it should be. It's just in my work everything is date only. E.g. employment starts on a date, not datetime, there's no data loss.

rixed a day ago

> country_code 01K3Y07Z94DGJWVMB0JG4YSDBV

A 7th normal form should mandate that no identifiers should ever be assigned to identifiers.

hdjrudni a day ago

Not sure about that. Some folks would argue you should always use surrogate keys.
I probably wouldn't for country_code specifically, but for most things its useful even when you have a 'natural' key.
- rixed 20 hours ago
  
  It can be useful in some cases but can also be a hindrance. First because identifiers are more useful when they allow to actualy identify the thing; and also because now they can change from instance to instance, from customer to customer etc...

mhalle a day ago

This format requires temporal validity with `valid_from`, but doesn't include `valid_to`. I don't understand how `valid_from` and the also required `recorded_at` interact.

STKFLT a day ago

I don't have any additional insight to the format, but I think the idea is that there is an implied ->infinity on every date range. Every bank can only have one bank_name so multiple bank_names for the same bank entity can be sorted on the 'valid' and 'recorded' axes to find the upper bounds of each.
- dragonwriter a day ago
  
  In the bitemporal model, both system and valid times are half-open intervals, and both the preceding and following interval can either have a different value or no value. Using only start times means that while records can be updated in either time stream, they cannot be logically deleted (in transaction time) or invalidated (in valid time) once they exist. There are databases where this assumption is valid, but in general it is problematic.

spennant a day ago

Odd seeing this right now for me. I recently implemented a 6NF schema for parsed XBRL files from EDGAR. The architecture was the right call... too bad the data is not useful for analytics.

unquietwiki a day ago

Looks interesting, but few comments on the forum & even a negative vote count ATM. Format kinda looks "old school" in terms of defining records, but I guess that can be a positive in some circumstances?

sergeyprokhoren 14 hours ago

What does looks "old school" mean? Do you want to wrap this format in JSON like JSON-LD? I don't mind
inkyoto a day ago

I would say it is a niche solution that solves a specific problem.
Modern data sources increasingly lean towards and produce nested and deeply nested semi-structured datasets (i.e. JSON) that are heavily denormalised and rely on organisation-wide entity ID's rather than system-generated referential integrity ID's (PK and FK ID's). That is a reason why modern data warehouse products (e.g. Redshift) have added extensive support for the nested data processing – because it neither makes sense to flatten/un-nest the nested data nor is it easy to do anyway.
- sergeyprokhoren 21 hours ago
  
  This is a fairly common problem. Data is often transferred between information systems in denormalized form (tables with hundreds of columns - attributes). In the data warehouse, they are normalized (data duplication in tables is excluded by using references to reference tables) to make it easier to perform complex analytical queries to the data. Usually, they are normalized to 3NF and very rarely to 6NF, since there is still no convenient tool for 6NF (see my DSL: https://medium.com/@sergeyprokhorenko777/dsl-for-bitemporal-... ). And then the data is again denormalized in data marts to generate reports for external users. All these cycles of normalization - denormalization - normalization - denormalization are very expensive for IT departments. Therefore, I had an idea to transfer data between information systems directly in normalized form, so that nothing else would have to be normalized. The prototypes were the Anchor Modeling and (to a much lesser extent) Data Vault methodologies.
  
  snthpy 17 hours ago
  
  Nice. Anchor Modelling is underappreciated.
  Gonna have a look at your DSL.