Modern data modeling: Start with the end?

FridgeSeal 3 years ago

As someone who works in the data-{engineering, science,etc} space, I really don’t understand the communities obsession with unwieldy tools like DBT.

It’s like there’s some obsession with ignoring any kind of practice in wider software engineering, or innovation in the PL theory space, in favour of…gluing sql together using string templates?

It seems like for the most part-there’s very little innovation or progress in the space. It’s just 15-variations on a theme of configure some yaml for your ninja-interpolated strings to dump ever more data into your cost-centre black hole known as a “modern data lake”.

I’m sure there’s some interesting things going on in small corners, but it’s difficult to find, and if it exists, it’s being studiously ignored by mainstream tooling.

lunatuna 3 years ago

There has been a big improvement in productivity and visibility when we introduced dbt along with Snowflake. The concepts of DataOps became real. We've been able to work a lot more like software developers to our advantage.
I didn't get it either. I thought it just looks like SQL with some complicated additions. Then I did some work with it and the flow got way better. We could work a lot more incrementally and as a larger team rather than as individuals tripping all over each other. Maybe there is a better way doing something else, but this clicked with the team.
Not sure I get your cost centre black hole comment, but for the most part storage is so cheap that it isn't a concern. ELT has made it pretty easy for us to do less analysis and work on the data gathering side, which is low value, and do more on the modeling side which is higher value. Further, analysts can play with the raw data and find what they want and if they like it they will give it to us to model in a formal way. No need to work with an ETL team and application team to bring the data over. In ye olden times, this was overthought and created friction - "If you don't know what you're going to exactly do with the data how am I going to know what to bring in and how to land it"."But how am I to know what to do with the data if I can't query it and see what opportunities there are."
On the unwieldy part, sounds like you've had a bad experience. Ours has been the opposite. Really easy to setup and use. Easy incremental step for the team to get their heads around. Integration with git and snowflake was really easy. I'm still a bit worried that something is going to get really messed up with the declarative approach that dbt has, but we've not seen any issues yet.
I've been curious as well about what more can be done to make the development lifecycle better. Most of the other "big data" approaches that I've worked with seemed to make it worse not better. A lot more complexity, not thinking in sets, layers and layers of nothing I care about that can go wrong. I can't believe I'm still working with SQL.
- FridgeSeal 3 years ago
  
  > I've been curious as well about what more can be done to make the development lifecycle better. Most of the other "big data" approaches that I've worked with seemed to make it worse not better
  Interesting, what other approaches and tools are you thinking of that are worse/failed/etc.
  
  lunatuna 3 years ago
  
  I've worked with both Hadoop/Spark and Greenplum. Hadoop for pre-processing and Greenplum for analytics. There was way too much time spent on the platform and optimizations. The use of both didn't improve our workflow either, it was substantially more complex and looking back would think that full rack of Oracle Exadata would have been better. This was a bit over 10 years ago and if I could do it all today I would stick to what we are doing now with dbt and Snowflake.
  All the promises of the power of these platforms was lost in the additional effort in managing the complexities. Developer productivity for the work that I've been doing is way more important than hw/sw costs. I can't think of anytime that this trade worked better where I had more focus on the sw/hw costs rather than developer time.
mythhouse 3 years ago

> I really don’t understand the communities obsession with unwieldy tools like DBT.
It lets me write test first sql transforms. I never thought TDD sql would be possible. My sql is so much more readable with common logic extracted into ephmeral models. I practice same method to write clear code to write sql, eg: too many mocks = refactor into separate model ( class) .
I think DBT made this possible with refs that can be swapped out with mocks. This is the awesome library I am using https://github.com/EqualExperts/dbt-unit-testing
- akhmatova 3 years ago
  
  My sql is so much more readable with common logic extracted into ephmeral models.
  Sounds great. My concern though with DBT is the cognitive overhead and just plain ... weirdness and clunkiness it seems to introduce as a cost to obtaining benefits like these.
  For example if we look at their own documentation:
  When using the view materialization, your model is rebuilt as a view on each run, via a create view as statement.
  https://docs.getdbt.com/docs/build/materializations
  Now, isn't your reaction, upon reading this, some version of: "Whaaat? Views are not materializations - by definition, they are essentially the opposite of materializations."
  And yet here the DBT documentation is describing the results of plain old "create view" statements as materializations.
  And then we're not getting into even more basic issues - like why we're supposed to believe that code like:
  select ... from {{ source('jaffle_shop', 'orders') }} left join {{ source('jaffle_shop', 'customers') }} using (customer_id)
  Is more readable or maintainable than what it compiles into:
  select ... from raw.jaffle_shop.orders left join raw.jaffle_shop.customers using (customer_id)
  I'm happy it works for you. But I'm just not seeing the magic here.
  
  mb7733 3 years ago
  
  > Now, isn't your reaction, upon reading this, some version of: "Whaaat? Views are not materializations - by definition, they are essentially the opposite of materializations
  I'd never heard of dbt before today but from a quick scan of that page it's clear to me what a that 'materialization' is in dbt. It's not the same as a SQL 'materialized view'; it could be a SQL view, an SQL table, a CTE, or a table with extra machinery for incremental updating.
  In my experience in software it's common for similar words to be used for different concepts in different contexts. I think the dbt documentation is well done so I don't see a big issue.
  
  akhmatova 3 years ago
  
  In my experience in software it's common for similar words to be used for different concepts in different contexts.
  Right, but when it happens it's usually because these words get used differently in different domains.
  What's happening here is that they're carelessly (in my view) mangling a well-established concept in a field (speaking of the SQL and database world) where folks are (notoriously) careful about word choice and about what things actually, in fact, mean.
  It just feels weird, and on a certain gut level - koolaid-y.
  
  FridgeSeal 3 years ago
  
  The whole hype around dbt and snowflake feels koolaid-y to me.
  They’re some products that have inexplicably captured so much s as attention for so little value and now trounce around acting like their approach is the only true way of doing things.
- FridgeSeal 3 years ago
  
  > It lets me write test first sql transforms. I never thought TDD sql would be possible
  I went through the library you linked, but I’m really, really not convinced. This just looks like a messier, more unclear way to test some SQL. A feat you could with some spark code (for example) in a significantly less confusing and boilerplate-y manner.
  I think the benefit is that dbt appears to give some cleaner way to separate and the re-compose SQL. Which is good, we do need that. However it does it in the most awkward and unwieldy way possible (string templating) and yet it’s heralded as “the true way forward”.
  
  mythhouse 3 years ago
  
  Sure a full fledged programming stack will obviously have a more mature tdd facilities. this if you choose to write sql .
  sql vs spark is probably a discussion for another time :0
  > This just looks like a messier, more unclear way to test some SQL.
  As a software engineer I've written my own share of unit tests. They are supposed to be documentation of how a unit might work but it rarely ever works in that way.
  I found that to be more true for CSV tests for sql
  https://github.com/EqualExperts/dbt-unit-testing#different-w...
  because they are always in the same format , input csv -> output csv
  I am more motivated to look at the tests to see what sql is doing. Its pretty straight forward.
vaughan 3 years ago

People don't have enough time to step back from their tools and think about what the ideal thing would look like.
I'm convinced this entire space should be visual. We always visualize data pipelines in our minds and on whiteboards...well, visually. Two-way code-diagram syncing is needed to allow best of both worlds. We should serialize to yaml from imperative code, but allow manipulating the diagram to modify the imperative code.
Ideally what you want is also to track the dependencies of every atom of data in your org, and then have something cache and incrementally compute updates.
The biggest thorn in this vision is SQL and the relational model, and I don't think a lot of people realize it.
It favors representing data such that the query planner can optimize query execution...rather than being able to track data dependencies and visualize data flow. It wasn't designed in mind with complex ETL pipelines and many external data sources of today's world.
- taeric 3 years ago
  
  As someone that is completely non visual, I personally don't buy it. Trying to keep things visual will almost certainly choke on all of the extra information that comes with being visual.
  I think it is easy to lob complaints at sql. The problem is every successor seems worse. Teams I've been close to lately have taken the python pill. Pandas, they push, is so much easier to understand. This works too an extent. But what we have dodged large sql files, we now have giant python files. Worse, they are trying to engineer things in "logical" ways for code sharing.
  What I'm seeing is they moved all of the complexity of most ETL engines into each python file. This feels great at the start, but is like how every Todo list project looks clean in basically any language. Ramp up the variety and size of data sources, and get ready for that clean idea to turn into the mess it thought it prevented.
  For my two cents, if your ETL setup doesn't factor in Conway's law in who owns different parts of the data, it will fail in time. Trying to hide split ownership of data is a sure sign that changes will get misunderstood and data will be lost or misused.
  
  vaughan 3 years ago
  
  > all of the extra information that comes with being visual
  I think the key is to not make everything visual. You want to be able to view your system as you would draw it on a whiteboard. I believe it could be done using a restricted syntax of a proglang.
  > Teams I've been close to lately have taken the python pill...
  Most people have a desired result and want it as fast as possible.
  When imperatively using map, groupBy, filter, you can usually get the immediate result you want very quickly. This is the appeal of Pandas, Python notebooks, etc.
  But when someone asks: I want to re-compute on change, which pretty much applies to every task...their quick solution is not optimal.
  And then if the schema changes, and you're doing all this intermediary stuff...then refactoring is next to impossible.
  Any SQL query could also be written in an imperative prog lang and its probably easier to understand and tweak and also trace. An SQL relation does not maintain any info about its source, which then makes incremental updates difficult, and also refactoring.
  You can basically ask the question: for every piece of data you see in your company, can you easily see the "data lineage" from its source and all the transformations it was involved in, and how clean and efficient is this pipeline when receiving new data. Then a step further is, how easy is it to change something.
  
  taeric 3 years ago
  
  The problem with, "as you would draw it on a whiteboard," is that much of what makes whiteboard drawings useful is his much you omit.
  Such that I agree being able to visualize things is nice. But don't confuse the visualization with the system. A step you have to make in many visual tools.
  I'm pretty much agreed on most of what you wrote, though.
  
  vaughan 3 years ago
  
  > But don't confuse the visualization with the system.
  I think this is where people have got it wrong in the past.
  This tool is an example of something in the right direction: https://www.ilograph.com/
  It's just completely disconnected from code.
  
  taeric 3 years ago
  
  Agreed that I think that is where folks go wrong.
  And the very valid counter claim from folks that do some of this is that there needs to be some mechanism to keep things synchronized between the visualization and the execution.
- FridgeSeal 3 years ago
  
  > I'm convinced this entire space should be visual.
  At my last 2 jobs I spent entirely too much time debugging Matillion jobs, which are visual. I have my doubts that it’s the panacea that it appears to be.
  That said, you may find Enso particularly interesting: https://github.com/enso-org/enso
  
  vaughan 3 years ago
  
  Thanks for the link. Enso looks very cool. The traceability aspect is nice. I'd never want to use a visual interface to "code" though like they show in their intro video. They have the two-way sync to code-diagram which is cool, and its to an imperative language, but it's to their own strange programming language. I just can't see myself coding Enso. If it were something familiar like Python or JS - then now we are talking!
- debarshri 3 years ago
  
  My first job was building pipelines in informatica's ETL tool and Pentaho. As far as I can remember, It used to be very visual back in the days. Hadoop and large scale pipeline came into picture and all of sudden these pipeline were super complex to manage, nobody knew what was happening, and we were riding big data hype. With scripting language, it is easy to test out these pipelines and showcase value before deploying them. Secondly, I think there is some sense of satisfaction or accomplishment when people script a pipeline than visually building them, even if it might more efficient in some cases. So, visual representation is a sexier pitch but it might be a hard sell.
- lukesingham 3 years ago
  
  >I'm convinced this entire space should be visual. We always visualize data pipelines in our minds and on whiteboards...well, visually. Two-way code-diagram syncing is needed to allow best of both worlds
  I'm also convinced by this! Currently building something like this at https://www.ilumadata.io/ We have a pipeline builder (not yet on our website) that lets you visually build out the data pipeline. At the same time it's creating the corresponding files for each node which you can switch to.
- mr_gibbins 3 years ago
  
  No, that isn't correct. The query planner optimises query execution in a layer of abstraction below the representation of the tables, the relational integrity and structure of the user-accessible objects.
  Tables, their attributes, relationships etc. are easily able to be represented visually. Entity-Relationship Diagrams have standard notation (Chen, Crows-foot, UML). We can represent models conceptually, logically or physically, we can even indicate the existence of views and indexes. In terms of queries, DML is essentially multi-filtered, multi-dimensional slicing of tables and can be represented as output sets which are as modellable as the source tables.
  On the wider point, DBT looks promising. It seems to be a good middle ground between the purists (like me) who wish back-end devs would just learn how to model data, and the adventurists who prefer to lock everything behind OOP principles.
  
  vaughan 3 years ago
  
  The query planner can always change how it runs and is a black box. Nodes in the plan thus cannot be cached. This means that for streaming we are usually re-running the entire query, or doing some custom stuff.
  When a source table changes, I want it to automatically and efficiently update anything that would change. I think pretty much every system would prefer real-time stuff like this.
  If you are doing this with SQL you will start looking into "Incremental View Maintenance" which is quite complex and still quite heavy.
  Then you realize that if you take control instead of handing it off to the planner, you could code your joins and transformations by hand, cache intermediary steps as needed, and have a query that is more efficient than SQL.
  But for this I would argue you need a better way to visualize your data flow and the dependency graph, because people can easily write slow imperative stuff.
AnEro 3 years ago

I'm all in on DBT for several reasons:
1. One source of transforms for static data 2. Easy to do plug-in or changes, even 'materializations', are Jinja. So I can have custom implementation for a specific issue 3. DBT to Metabase documentation, still super new but imagine other tools getting intake of metadata and metrics from dbt 4. I'm the data team, and managing a huge amount of workflows. If I just use a select and not DML style of writing the queries, it will 'orchestrate' it for me. 4.1 I can have a quick version controlled workflow for a new report/kpi/metric that we don't know if it is even important yet. 5. The speed trade-offs are made up for with the time saved from point 4 6. Embeddable in to the datamesh I dream of no one knowing I use DBT or any software at all, its all just [company branding] magic
- FridgeSeal 3 years ago
  
  So if I’m understanding this correctly, what dbt is giving you is really just a straightforward, reliable way to get SQL executed? You can write some queries, it’ll make sure they’re run (possibly on a schedule or in a dependency flow), it has a bunch of integrations, and it gives some handy lineage metadata. Is that about it?
  This whole really just makes me want to write an alternative to dbt that does this stuff, but better. I’m convinced there must be better solutions to sql composability than string interpolations and too-many-yaml-files.
  
  tomnipotent 3 years ago
  
  > write an alternative to dbt that does this stuff, but better
  How are you going to do that when your comments betray that you barely know what dbt does in the first place?
  > solutions to sql composability
  dbt isn't a solution to SQL composability, but it offers something similar in spirit because of how its DAG-based processing works.
  
  FridgeSeal 3 years ago
  
  > How are you going to do that when your comments betray that you barely know what dbt does in the first place?
  God forbid I ask questions to fill in my knowledge gaps. “Barely know” is a bit strong, I understand what it is, I don’t personally see the value proposition. The question about composability is hence a proxy- if I don’t see the value prop, I need to understand what the underlying benefit is from people who do value it. To me it just looks like some shiny bells and whistles on top of gluing a bunch of sql execution together. Other products have failed at that, so it’s got to be slightly more than that. The other thing it offers is composition via string templates, and seemingly, some level of orchestration as a direct result. Sometimes you have to ask “dumb looking” questions to make sure you don’t miss things.
  
  tomnipotent 3 years ago
  
  You're not asking questions, but making statements about why you think dbt isn't useful.
  > To me it just looks like some shiny bells and whistles on top of gluing a bunch of sql execution together
  So is Pentaho, Matillion, Tableau Prep, and most tooling that enables this kind of work on an RDBMS. At the end of the day, they're all just generating SQL strings from (usually proprietary) metadata. dbt lets me skip the UI bullshit and focus on soup-to-nuts data modeling in pure SQL, and doesn't try to hide the fact that we're working with an RDBMS or that we're spitting out SQL strings via Jinja.
  > The other thing it offers is composition via string templates
  Here's an example of one of those statements. What it does do is provide the ability to generate dynamic SQL using a mustache-style template syntax via Python/Jinja, giving SQL authors the ability to further exploit that combination to aid in dynamic SQL generation on an as needed-basis. You can use as much-or-little dynamic generation as you want, it's up to the SQL author to be disciplined in how it's applied.
  Most models I've seen have little-to-no-dynamic SQL, outside of incremental load predicates in the WHERE clause. The few instances I can think of top-of-mind are using things like information_schema to dynamically generate lists of columns for self-updating models.
  What you're thus "composing" is a DAG of views/tables generated via dynamic SQL. It would not be accurate to say you're composing an individual SQL model, since you're really just stitching strings together. This is an important distinction, because "composition" is a terribly overloaded term in software and in this context might lead the reader to believe dbt is trying to re-invent SQL via a DSL that supports composition, which is clearly not the case.
  You won't even find the word "compose" on the dbt docs that isn't in reference to Docker.
  > some level of orchestration as a direct result
  Each dbt model becomes either a view or table, and dbt SQL authors can build models using other models as if they were existing views/tables. Under the hood, dbt parses models pre-exec to build a DAG to determine what order views/tables need to be created. The SQL author doesn't need to do anything more than include the model via {{ ref(model_name) }} macro in a template.
  dbt is about embracing the database and exploiting it to the best of its capabilities, rather than delegating to an intermediary to fill in the missing pieces. I don't need Java to split a string or concat an array of values in 2022, when most vendors now have incredibly robust built-in functions or UDF support. I don't need Airflow to define workflow order. I don't need to bug IT to stand up infra to run software they have to manage to support my workloads.
  What you get is:
  - Project structure and conventions for a team to collaborate on SQL-based data models
  - Tooling that ensures data & transformations don't need to leave the database
  - Tooling to generate SQL and execute statements in the order of dependencies, including parallelization
  - Built-in support for incremental loading and snapshots
  - Ability to control how models are managed (view vs. table) and optimize as needed
  - Conventions for enriching metadata via YAML (such as for docs)
  I currently have a small 2TB database on Snowflake with ~100 models that I can rebuild from scratch in about ~5 mins. I had to spend more time learning Snowflake syntax than I did having to learn or support dbt.
panda888888 3 years ago

You have to go back a step, to ETL vs ELT. ELT is powerful because you're doing the transformations as the final step, so you can quickly ingest the data and then transform it however you want and in different ways. While you may not love DBT, the innovative thing here is that the IT department no longer needs to do this work (this is in contrast to ETL). A data analyst or business analyst or anyone who knows SQL can now do it. So if you're a software engineer, you're probably not the target audience for DBT. I agree with your criticism that the tool itself isn't necessarily that innovative, but if you think of DBT as version control/templating for people who aren't already software engineers, it's easier to see why it's useful.
- FridgeSeal 3 years ago
  
  > You have to go back a step, to ETL vs ELT. ELT is powerful because you're doing the transformations as the final step, so you can quickly ingest the data and then transform it however you want and in different ways
  No I get it, land your data, apply transforms land it, repeat. I’m already following this approach, so maybe the value prop of dbt is lost on me.
  My concern is that I’ve worked in places where the proverbial “reigns” of “do whatever” with the data were handed off to anyone who asked for them, and I was involved trying to fix the inevitable mess that was created. There’s zero coherency, so 8 different people create 5 different, incompatible reports for the same thing, mishandled money and date values everywhere, etc. That’s not an experience I wish to relive any time soon.
  
  panda888888 3 years ago
  
  That sounds like a data governance problem that goes above and beyond the things that DBT does. The typical way to handle this is by implementing a gold/silver/bronze architecture.
jdub 3 years ago

The whole point of dbt is to bring software engineering principles to the handling of data for analytics. It's still SQL (as are all the modern data warehouses) because nothing else has beaten it.
- FridgeSeal 3 years ago
  
  I’ve been using software engineering principles in my data work (and plenty of SQL) and haven’t had a need for dbt.
  SQL is amazing, I’m not convinced string templates and this one tool offer enough value to be worth the hype they receive.
alexyz12 3 years ago

What do you use instead?
- datalopers 3 years ago
  
  regular ole views and mviews work just fine and provide the same composability.
  
  EDEdDNEdDYFaN 3 years ago
  
  And how do you manage the versioning and deployment of your views and mviews? What tooling do you use to refresh those materialized views on a schedule?
  Is it something you've written and maintain yourself? What testing do you have on them?
  I find that dbt answers all of these questions pretty simply in a way that avoids the need to reinvent the wheel. It's not some godlike piece of software, just a helpful framework wrapper around plain old SQL
  
  FridgeSeal 3 years ago
  
  > And how do you manage the versioning and deployment of your views and mviews?
  Underlying SQL lives in git, along with the code. Deployment happens via normal CICD pipelines.
  > What tooling do you use to refresh those materialized views on a schedule?
  A database that automatically rebuilds views when the underlying table changes (ClickHouse), or if I’m using postgres, a “dumb” job that simply updates the views. Periodically.
  > Is it something you've written and maintain yourself? What testing do you have on them?
  Written and maintain myself? Not really, it’s either Spark or Rust code that’s a very thin wrapper to drive the underlying SQL. Everything runs in Kubernetes, as either plain deployments (for ongoing/streaming workloads) or cronjobs for periodic workloads. SQL and wrapper code are tested using the normal in-language testing tools.
  > I find that dbt answers all of these questions pretty simply in a way that avoids the need to reinvent the wheel.
  I don’t think people not using dbt are reinventing the wheel, in my case I’m not doing anything that wouldn’t be considered exceedingly boring in architecture and usage.
  Avoiding weird proprietary tools also means that the technology and approach used is almost identical to what the devs use, which means more shared knowledge and better reusability.
  
  Helmut10001 3 years ago
  
  I love mviews since discovering them! So easy to split complex queries into multiple smaller ones, which are then queried sequentially through mviews.
  
  epgui 3 years ago
  
  Sounds just like dbt, but less organized and without jinja templates.
  
  adammarples 3 years ago
  
  People who don't use dbt usually have a half baked implementation they wrote themselves in python
  
  Helmut10001 3 years ago
  
  I write my SQL queries in Jupyter Lab, mixed in Python code and f-strings, which is enough of a template engine to my needs atm. I can also directly visualize stuff with pandas (etc.)
  species: str = 'acer' structure: str = 'trees.table' query = f""" SELECT * FROM {structure} WHERE species = {species} """

b-luu 3 years ago

Thanks for sharing!!

Author here: yeah I ended up insisting a bit much on dbt as a basis although the method I'm describing in the article can definitely be applied to any kind of modeling framework (or lack of)

Anyway, please let me know what you think of the general method and the bottom-up VS top-down approach ;-)

sails 3 years ago

Good detailed breakdown of how to use dbt, but I'm not sure about the data modelling part (the specific approach suggested), as it seems possibly contradictory:
> In contrast, the approach I want to outline here is primarily bottom-up, in order to:
- deliver value and insights as quickly as possible
- not pile up too much (data modeling) technical debt along the way.
> The goal will be to aim towards either: - a dimensional model/star schema - or a limited collection of "One Big Tables" (or any other top-down theory you’re most comfortable with...)
I get (and agree) that you want to start off organic and get things moving with some integrations, but I quite strongly feel that with the suggested approach (which feels to me like - just start with the dbt part, don't worry about the final design part), especially if you don't decide _up front_ what capabilities you, your team, your client have, and choose an approach based on that, you'll likely end up with something of a technical debt mess.
My suggestion would be, unless otherwise advised, to just adopt the dimensional model, put that in the hands of your users early, and start getting feedback. In the meantime you can do exactly what is described, but with a feedback loop kicked off much earlier than you would have otherwise.
edit - maybe contradictory is not correct, the suggestion stands though :)
- b-luu 3 years ago
  
  Thanks for your suggestion.
  I have nothing against dimensional modeling per-say, just the uptime effort (and initial feedback lag) that it generally brings with it in the beginning.
  Another issue I've observed with teams trying to formalize their modeling "too" soon is a confusion in the models created. I find that it's sometimes difficult for teams to understand what should be in each model, how to separate the different dimensions, etc. Hence my emphasis on starting naively in the beginning and then aiming for that dimensional model when things start becoming clearer...
  
  sails 3 years ago
  
  Agree it is very sensitive to the context.
  The outcomes are a mixed bag, but my biggest experienced failing was on a project starting naively with OBT and waiting for things to become clearer, it did indeed soon became clear that we should have taken a more methodical approach, as we had the dual problem of huge dependencies and a very brittle data model!
  (I don't think there is an easy answer here btw, I am hoping for someone to describe a detailed and relatively generalisable playbook, hopefully soon!)