remember 15 years ago there were posts about DYI drone from some random guy with lots of theoretical physics about stable conditions derivations. it got a lot of criticism. now looking back and following what DJI is doing with sensors, his approach was totally wrong and that community nailed it with feedback. the forum got some extravagant ideas and some worthy criticism. at least back then.
I remember visiting this site daily 10-15 years ago, in russian, ofc. The moderation was super high, karma system worked great, the content quality was astonishing. Then they switched up owners, tried heavily monetizing corpo-pseudo-blogpost-marketing crap and it all went downhill from there
It went downhill when they allowed getting an invitation via single blog post, requiring just one person to like it enough to give an invitation. Which wasn't hard to write - just translate something popular from hackernews before anyone else does it.
Shortly after, it became hilariously easy to farm and manipulate karma balances across the entire site. With 50 accounts (mults or real people all the same) you could create a new account a day.
Monetization started when it was already in a death spiral.
habr is an institution. it's like the "runet hn", minus wild west vc ecosystem, plus integrated blog posting like lj ogs intended to. probably helps a lot with original work like TFA getting traction. more power to that!
runet sites of that era are often born out of the hacker's characteristic contrarian attitude "because we can". attempts to monetize them in more recent years are bound to accomplish little more than fuck up the content quality and/or the "owner cashes out and opens cafe" thing.
nevertheless, to this day, when i think habrahabr, i think way higher bar for technical competence than hn. it's all in the attitude.
Improvement idea -- in my experience "valid_from" is always a date (no time, no timezone). That's how it's reported in documents (e.g. contract validity period).
Rows that need seconds (e.g. bank transactions) are events, they aren't "valid" from a particular point in time forward, they just happen.
In my experience, validity time may start at the start of business day, and likely has a specific time zone. In particular, I've seen how documents related to stock trading at NASDAQ specify Eastern Standard Time as the applicable timezone.
I understand how convenient it is to use UTC-only timestamps. It works in most cases but not all.
No point losing information like that. What do you do if someone opens and closes an account on the same day? Changes their email address three times in one day? Etc
Agree, if your updates have time then datetime it should be. It's just in my work everything is date only. E.g. employment starts on a date, not datetime, there's no data loss.
It can be useful in some cases but can also be a hindrance. First because identifiers are more useful when they allow to actualy identify the thing; and also because now they can change from instance to instance, from customer to customer etc...
This format requires temporal validity with `valid_from`, but doesn't include `valid_to`. I don't understand how `valid_from` and the also required `recorded_at` interact.
I don't have any additional insight to the format, but I think the idea is that there is an implied ->infinity on every date range. Every bank can only have one bank_name so multiple bank_names for the same bank entity can be sorted on the 'valid' and 'recorded' axes to find the upper bounds of each.
In the bitemporal model, both system and valid times are half-open intervals, and both the preceding and following interval can either have a different value or no value. Using only start times means that while records can be updated in either time stream, they cannot be logically deleted (in transaction time) or invalidated (in valid time) once they exist. There are databases where this assumption is valid, but in general it is problematic.
Odd seeing this right now for me. I recently implemented a 6NF schema for parsed XBRL files from EDGAR. The architecture was the right call... too bad the data is not useful for analytics.
Looks interesting, but few comments on the forum & even a negative vote count ATM. Format kinda looks "old school" in terms of defining records, but I guess that can be a positive in some circumstances?
I would say it is a niche solution that solves a specific problem.
Modern data sources increasingly lean towards and produce nested and deeply nested semi-structured datasets (i.e. JSON) that are heavily denormalised and rely on organisation-wide entity ID's rather than system-generated referential integrity ID's (PK and FK ID's). That is a reason why modern data warehouse products (e.g. Redshift) have added extensive support for the nested data processing – because it neither makes sense to flatten/un-nest the nested data nor is it easy to do anyway.
This is a fairly common problem. Data is often transferred between information systems in denormalized form (tables with hundreds of columns - attributes). In the data warehouse, they are normalized (data duplication in tables is excluded by using references to reference tables) to make it easier to perform complex analytical queries to the data. Usually, they are normalized to 3NF and very rarely to 6NF, since there is still no convenient tool for 6NF (see my DSL: https://medium.com/@sergeyprokhorenko777/dsl-for-bitemporal-... ). And then the data is again denormalized in data marts to generate reports for external users. All these cycles of normalization - denormalization - normalization - denormalization are very expensive for IT departments. Therefore, I had an idea to transfer data between information systems directly in normalized form, so that nothing else would have to be normalized. The prototypes were the Anchor Modeling and (to a much lesser extent) Data Vault methodologies.
> habr.com
interesting to see this forum show-up again.
remember 15 years ago there were posts about DYI drone from some random guy with lots of theoretical physics about stable conditions derivations. it got a lot of criticism. now looking back and following what DJI is doing with sensors, his approach was totally wrong and that community nailed it with feedback. the forum got some extravagant ideas and some worthy criticism. at least back then.
I remember visiting this site daily 10-15 years ago, in russian, ofc. The moderation was super high, karma system worked great, the content quality was astonishing. Then they switched up owners, tried heavily monetizing corpo-pseudo-blogpost-marketing crap and it all went downhill from there
It went downhill when they allowed getting an invitation via single blog post, requiring just one person to like it enough to give an invitation. Which wasn't hard to write - just translate something popular from hackernews before anyone else does it.
Shortly after, it became hilariously easy to farm and manipulate karma balances across the entire site. With 50 accounts (mults or real people all the same) you could create a new account a day.
Monetization started when it was already in a death spiral.
don't forget the awful redesign, including completely replacing post formatter
all accumulated mastery of creating posts by experienced authors - gone overnight
habr is an institution. it's like the "runet hn", minus wild west vc ecosystem, plus integrated blog posting like lj ogs intended to. probably helps a lot with original work like TFA getting traction. more power to that!
runet sites of that era are often born out of the hacker's characteristic contrarian attitude "because we can". attempts to monetize them in more recent years are bound to accomplish little more than fuck up the content quality and/or the "owner cashes out and opens cafe" thing.
nevertheless, to this day, when i think habrahabr, i think way higher bar for technical competence than hn. it's all in the attitude.
What are the modern equivalents of habr?
There's probably none. The Russian Internet has been Eternal Septembered too much for something similar to appear.
if i knew any, i sure as fuck wouldn't post them on hn of all places.
It was also notoriously politics-free, until something happened.
Improvement idea -- in my experience "valid_from" is always a date (no time, no timezone). That's how it's reported in documents (e.g. contract validity period).
Rows that need seconds (e.g. bank transactions) are events, they aren't "valid" from a particular point in time forward, they just happen.
In my experience, validity time may start at the start of business day, and likely has a specific time zone. In particular, I've seen how documents related to stock trading at NASDAQ specify Eastern Standard Time as the applicable timezone.
I understand how convenient it is to use UTC-only timestamps. It works in most cases but not all.
No point losing information like that. What do you do if someone opens and closes an account on the same day? Changes their email address three times in one day? Etc
Agree, if your updates have time then datetime it should be. It's just in my work everything is date only. E.g. employment starts on a date, not datetime, there's no data loss.
> country_code 01K3Y07Z94DGJWVMB0JG4YSDBV
A 7th normal form should mandate that no identifiers should ever be assigned to identifiers.
Not sure about that. Some folks would argue you should always use surrogate keys.
I probably wouldn't for country_code specifically, but for most things its useful even when you have a 'natural' key.
It can be useful in some cases but can also be a hindrance. First because identifiers are more useful when they allow to actualy identify the thing; and also because now they can change from instance to instance, from customer to customer etc...
This format requires temporal validity with `valid_from`, but doesn't include `valid_to`. I don't understand how `valid_from` and the also required `recorded_at` interact.
I don't have any additional insight to the format, but I think the idea is that there is an implied ->infinity on every date range. Every bank can only have one bank_name so multiple bank_names for the same bank entity can be sorted on the 'valid' and 'recorded' axes to find the upper bounds of each.
In the bitemporal model, both system and valid times are half-open intervals, and both the preceding and following interval can either have a different value or no value. Using only start times means that while records can be updated in either time stream, they cannot be logically deleted (in transaction time) or invalidated (in valid time) once they exist. There are databases where this assumption is valid, but in general it is problematic.
Odd seeing this right now for me. I recently implemented a 6NF schema for parsed XBRL files from EDGAR. The architecture was the right call... too bad the data is not useful for analytics.
Looks interesting, but few comments on the forum & even a negative vote count ATM. Format kinda looks "old school" in terms of defining records, but I guess that can be a positive in some circumstances?
What does looks "old school" mean? Do you want to wrap this format in JSON like JSON-LD? I don't mind
I would say it is a niche solution that solves a specific problem.
Modern data sources increasingly lean towards and produce nested and deeply nested semi-structured datasets (i.e. JSON) that are heavily denormalised and rely on organisation-wide entity ID's rather than system-generated referential integrity ID's (PK and FK ID's). That is a reason why modern data warehouse products (e.g. Redshift) have added extensive support for the nested data processing – because it neither makes sense to flatten/un-nest the nested data nor is it easy to do anyway.
This is a fairly common problem. Data is often transferred between information systems in denormalized form (tables with hundreds of columns - attributes). In the data warehouse, they are normalized (data duplication in tables is excluded by using references to reference tables) to make it easier to perform complex analytical queries to the data. Usually, they are normalized to 3NF and very rarely to 6NF, since there is still no convenient tool for 6NF (see my DSL: https://medium.com/@sergeyprokhorenko777/dsl-for-bitemporal-... ). And then the data is again denormalized in data marts to generate reports for external users. All these cycles of normalization - denormalization - normalization - denormalization are very expensive for IT departments. Therefore, I had an idea to transfer data between information systems directly in normalized form, so that nothing else would have to be normalized. The prototypes were the Anchor Modeling and (to a much lesser extent) Data Vault methodologies.
Nice. Anchor Modelling is underappreciated.
Gonna have a look at your DSL.