stux a year ago

As the article notes,

> Redundant conditions are nice because they require no changes to the database [...] This makes them useful for queries that are only sometimes run or where indexes can't be easily added to the main conditions

but where possible I typically prefer adding an index on the expression that the query is using (if your database supports it) since it expresses your intent more clearly. A redundant condition is more likely to get "optimized away" by other developers or changes in the query planner.

  • shrdlu2 a year ago

    Yeah, I would think functional/function-based indexes should be mentioned. Add an index or add a redundant column? Almost all of time I would prefer the index.

    Of course if your database can't be changed then maybe you can use this trick if there is already a suitable indexed column.

karmakaze a year ago

The largest query I wrote was for estimating student loan repayments over a large time period. It solved a problem by running a dense 2 page query for 2 minutes in DB2 on a mainframe. The only way it ran at all was with many heuristic bounding conditions that limited the search space for the more specific conditions, as well as applying the most recent PTFs (patches) to increase query complexity limits. It wasn't however at all maintainable so was legacy code from day one.

berkle4455 a year ago

> In this example, we could use a range scan instead of the YEAR function to obtain the same result.

The range scan will miss all items with a created_at from 2023-12-31 00:00:01 to 2023-12-31 23:59:59

  • aarondf a year ago

    That's true, I forgot to add the time. I'll update! Thanks

  • slt2021 a year ago

    from 2023-12-31 00:00:00.000 to 2023-12-31 23:59:59.999

    • berkle4455 a year ago

      wouldn't '2023-12-31' cast to '2023-12-31 00:00:00.000'? And it's the +1 ms where the cutoff begins?

web3-is-a-scam a year ago

Does this technique generally work with postgres as well?

  • wswope a year ago

    Yeah, TIMESTAMPTZ expressions don’t index nicely because they’re nondeterministic (since they depend on session time zone). So if you’re running heavy time-series queries, you can speed them up by adding a predicate that’ll use a AT TIME ZONE ‘UTC’ expression, which is more easily indexable.

    E.g.:

    WHERE event_timestamp >= NOW() - ‘30 days’::INTERVAL AND (event_timestamp AT TIME ZONE ‘UTC’)::DATE >= NOW() - ‘31 days’::INTERVAL

    (^This particular example probably has a better solution, but I’ve used the same pattern for DATE_TRUNC expressions and seen dramatic improvements.)

    • sa46 a year ago

      > Yeah, TIMESTAMPTZ expressions don’t index nicely because they’re nondeterministic

      Could you expand on why it's not deterministic? I thought timestamptz was 8-byte microseconds since the epoch. Is the problem that the timestamptz uses the server timezone?

      • wswope a year ago

        Yeah, you've got the right idea. Specifically, functions that operate on TIMESTAMPTZ aren't typically marked as IMMUTABLE (because their output implicitly depends on the connection-level time zone setting), and that IMMUTABLE flag is required for functions used in indexed expressions.

        https://www.postgresql.org/docs/current/xfunc-volatility.htm...

    • luhn a year ago

      You are correct that date_trunc will not utilize an index. To the query planner that's an opaque function; it has no idea how to translate that to an index condition.

      However, the example you gave works just fine. It's all just UTC in backend and very much deterministic.

          postgres=# create table timetest as select t.time, 'hi' as foo from generate_series('2000-01-01'::timestamptz, '2010-01-01'::timestamptz, '5 minutes'::interval) t(time);
          SELECT 1052065
          postgres=# create index on timetest(time);
          CREATE INDEX
          postgres=# explain analyze select * from timetest where time > now() - '30 days'::interval;
          ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
          │                                                         QUERY PLAN                                                          │
          ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
          │ Index Scan using timetest_time_idx on timetest  (cost=0.43..4.45 rows=1 width=11) (actual time=0.019..0.019 rows=0 loops=1) │
          │   Index Cond: ("time" > (now() - '30 days'::interval))                                                                      │
          │ Planning Time: 1.163 ms                                                                                                     │
          │ Execution Time: 0.081 ms                                                                                                    │
          └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
      
      If you query frequently with date_trunc, you may want to use an expression index:

          postgres=# create index on timetest(date_trunc('day', time at time zone 'US/Pacific'));
          CREATE INDEX
          postgres=# explain analyze select * from timetest where date_trunc('day', time at time zone 'US/Pacific') > now() - '30 days'::interval;
          ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
          │                                                            QUERY PLAN                                                             │
          ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
          │ Index Scan using timetest_date_trunc_idx on timetest  (cost=0.43..4.45 rows=1 width=11) (actual time=0.075..0.076 rows=0 loops=1) │
          │   Index Cond: (date_trunc('day'::text, timezone('US/Pacific'::text, "time")) > (now() - '30 days'::interval))                     │
          │ Planning Time: 0.379 ms                                                                                                           │
          │ Execution Time: 0.124 ms                                                                                                          │
          └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
      
      Note that to make this work you have to convert "time" to a timestamp (no tz), because date_trunc with timestamptz is not an immutable function and cannot be indexed—I believe this is what you may have been thinking of when you say that timestamptz is non-deterministic.
      • wswope a year ago

        Yeah, crappy example on my end - thank you for fleshing out the details ;).

    • aarondf a year ago

      That's a great use case!

  • mdpye a year ago

    I haven't checked if this works in mysql, but in postgres I would certainly reach for an index on the computed values, which it supports.

    Index the thing you're querying on...

    • aarondf a year ago

      Totally agree! From the article:

      > The optimal indexing strategy always depends on the application, but in general, it's best to have indexes on the conditions you are frequently querying against. [...] This makes redundant conditions useful for queries that are only sometimes run or where indexes can't be easily added to the main conditions.

    • berkle4455 a year ago

      MySQL 8+ also supports functional indexes

deleugpn a year ago

At this point, doesn't it make more sense to just throw away the original where condition and use only the redundant condition? If it's really redundant and makes no different to the result, why keeping the inefficient clause be useful once the new/better clause has been added?

  • zerocrates a year ago

    The strategy here is that you have a real condition which isn't index-compatible, so to improve performance, you add on an extra condition which is index-compatible and is... in the ballpark of the condition you want, specifically containing your entire real desired result set, plus some spillover where it's not quite filtering tightly enough. The idea being that the broader condition that uses the index will get done first, then the slow row-scan will happen over a much smaller set of rows: same results but better performance.

    But, because the index-compatible condition is too broad, you still need to have the "original" one to get the correct results. If your new, "redundant", condition gets the exact correct results then you're not doing this pattern, you're just replacing a query that can't use an index with one that can, and in that case sure, it doesn't make sense to keep the old one around.

  • aarondf a year ago

    The redundant condition is broader than the real condition. Removing the real condition would allow false positives in