Using Redundant Conditions to Unlock Indexes in MySQL

stux 2 years ago

As the article notes,

> Redundant conditions are nice because they require no changes to the database [...] This makes them useful for queries that are only sometimes run or where indexes can't be easily added to the main conditions

but where possible I typically prefer adding an index on the expression that the query is using (if your database supports it) since it expresses your intent more clearly. A redundant condition is more likely to get "optimized away" by other developers or changes in the query planner.

shrdlu2 2 years ago

Yeah, I would think functional/function-based indexes should be mentioned. Add an index or add a redundant column? Almost all of time I would prefer the index.
Of course if your database can't be changed then maybe you can use this trick if there is already a suitable indexed column.
- aarondf 2 years ago
  
  Yeah I could definitely add some links to or descriptions of functional indexes.
  If you're curious, here's a video I did on it: https://planetscale.com/courses/mysql-for-developers/indexes...

karmakaze 2 years ago

The largest query I wrote was for estimating student loan repayments over a large time period. It solved a problem by running a dense 2 page query for 2 minutes in DB2 on a mainframe. The only way it ran at all was with many heuristic bounding conditions that limited the search space for the more specific conditions, as well as applying the most recent PTFs (patches) to increase query complexity limits. It wasn't however at all maintainable so was legacy code from day one.

inkeddeveloper 2 years ago

All code is legacy code from day 1.

berkle4455 2 years ago

> In this example, we could use a range scan instead of the YEAR function to obtain the same result.

The range scan will miss all items with a created_at from 2023-12-31 00:00:01 to 2023-12-31 23:59:59

aarondf 2 years ago

That's true, I forgot to add the time. I'll update! Thanks
slt2021 2 years ago

from 2023-12-31 00:00:00.000 to 2023-12-31 23:59:59.999
- berkle4455 2 years ago
  
  wouldn't '2023-12-31' cast to '2023-12-31 00:00:00.000'? And it's the +1 ms where the cutoff begins?

web3-is-a-scam 2 years ago

Does this technique generally work with postgres as well?

wswope 2 years ago

Yeah, TIMESTAMPTZ expressions don’t index nicely because they’re nondeterministic (since they depend on session time zone). So if you’re running heavy time-series queries, you can speed them up by adding a predicate that’ll use a AT TIME ZONE ‘UTC’ expression, which is more easily indexable.

E.g.:

WHERE event_timestamp >= NOW() - ‘30 days’::INTERVAL AND (event_timestamp AT TIME ZONE ‘UTC’)::DATE >= NOW() - ‘31 days’::INTERVAL

(^This particular example probably has a better solution, but I’ve used the same pattern for DATE_TRUNC expressions and seen dramatic improvements.)

sa46 2 years ago

> Yeah, TIMESTAMPTZ expressions don’t index nicely because they’re nondeterministic
Could you expand on why it's not deterministic? I thought timestamptz was 8-byte microseconds since the epoch. Is the problem that the timestamptz uses the server timezone?
- wswope 2 years ago
  
  Yeah, you've got the right idea. Specifically, functions that operate on TIMESTAMPTZ aren't typically marked as IMMUTABLE (because their output implicitly depends on the connection-level time zone setting), and that IMMUTABLE flag is required for functions used in indexed expressions.
  https://www.postgresql.org/docs/current/xfunc-volatility.htm...

luhn 2 years ago

You are correct that date_trunc will not utilize an index. To the query planner that's an opaque function; it has no idea how to translate that to an index condition.

However, the example you gave works just fine. It's all just UTC in backend and very much deterministic.

    postgres=# create table timetest as select t.time, 'hi' as foo from generate_series('2000-01-01'::timestamptz, '2010-01-01'::timestamptz, '5 minutes'::interval) t(time);
    SELECT 1052065
    postgres=# create index on timetest(time);
    CREATE INDEX
    postgres=# explain analyze select * from timetest where time > now() - '30 days'::interval;
    ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                                         QUERY PLAN                                                          │
    ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
    │ Index Scan using timetest_time_idx on timetest  (cost=0.43..4.45 rows=1 width=11) (actual time=0.019..0.019 rows=0 loops=1) │
    │   Index Cond: ("time" > (now() - '30 days'::interval))                                                                      │
    │ Planning Time: 1.163 ms                                                                                                     │
    │ Execution Time: 0.081 ms                                                                                                    │
    └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

If you query frequently with date_trunc, you may want to use an expression index:

    postgres=# create index on timetest(date_trunc('day', time at time zone 'US/Pacific'));
    CREATE INDEX
    postgres=# explain analyze select * from timetest where date_trunc('day', time at time zone 'US/Pacific') > now() - '30 days'::interval;
    ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                                            QUERY PLAN                                                             │
    ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
    │ Index Scan using timetest_date_trunc_idx on timetest  (cost=0.43..4.45 rows=1 width=11) (actual time=0.075..0.076 rows=0 loops=1) │
    │   Index Cond: (date_trunc('day'::text, timezone('US/Pacific'::text, "time")) > (now() - '30 days'::interval))                     │
    │ Planning Time: 0.379 ms                                                                                                           │
    │ Execution Time: 0.124 ms                                                                                                          │
    └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Note that to make this work you have to convert "time" to a timestamp (no tz), because date_trunc with timestamptz is not an immutable function and cannot be indexed—I believe this is what you may have been thinking of when you say that timestamptz is non-deterministic.

wswope 2 years ago

Yeah, crappy example on my end - thank you for fleshing out the details ;).

aarondf 2 years ago

That's a great use case!

mdpye 2 years ago

I haven't checked if this works in mysql, but in postgres I would certainly reach for an index on the computed values, which it supports.
Index the thing you're querying on...
- aarondf 2 years ago
  
  Totally agree! From the article:
  > The optimal indexing strategy always depends on the application, but in general, it's best to have indexes on the conditions you are frequently querying against. [...] This makes redundant conditions useful for queries that are only sometimes run or where indexes can't be easily added to the main conditions.
- berkle4455 2 years ago
  
  MySQL 8+ also supports functional indexes

deleugpn 2 years ago

At this point, doesn't it make more sense to just throw away the original where condition and use only the redundant condition? If it's really redundant and makes no different to the result, why keeping the inefficient clause be useful once the new/better clause has been added?

zerocrates 2 years ago

The strategy here is that you have a real condition which isn't index-compatible, so to improve performance, you add on an extra condition which is index-compatible and is... in the ballpark of the condition you want, specifically containing your entire real desired result set, plus some spillover where it's not quite filtering tightly enough. The idea being that the broader condition that uses the index will get done first, then the slow row-scan will happen over a much smaller set of rows: same results but better performance.
But, because the index-compatible condition is too broad, you still need to have the "original" one to get the correct results. If your new, "redundant", condition gets the exact correct results then you're not doing this pattern, you're just replacing a query that can't use an index with one that can, and in that case sure, it doesn't make sense to keep the old one around.
aarondf 2 years ago

The redundant condition is broader than the real condition. Removing the real condition would allow false positives in

lee101 2 years ago

[dead]