the employee IDs passed from the client application are completely random, e.g. there is no additional attribute that could be used for filtering, like a department ID or such? Not even for an approximate selection?
I would recommend benchmarking
IN clauses with the cardinality of employee IDs that you need and verifying you still get good performance. I’m less concerned that CrateDB can’t quickly obtain the rows, but rather about the sheer textual size of the resulting SQL query. Independent of CrateDB-specifics I think a very large
IN clause adds significant overhead in compiling the query, transmitting it, parsing it, and so on.
It might be an option to partition the query, meaning not running one huge
IN clause but running several queries with each a subset of the
IN clause (depending on the results of your benchmarking).
What type of client application are we talking about? If millions of rows are returned, I assume this is not a directly user-facing application, but some sort of automated further processing? It might even be worth considering to do an approximate selection of rows in CrateDB and delegate parts of the filtering to the client application. Like obtaining all employees from let’s say a whole department/location/etc (even if that returns slightly too many rows) and doing some fine-grained filtering for specific employee IDs in the client application.