PostgreSQL: Correct way for subset calculations
I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id
and activity_id
like this
user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
I want to select all user_id
with activity_id = 1, 2 NOT 3
. In the case above it will be just one result: user_id = 2
. I can do it using SELECT DISTINCT
combined with INTERSECT
and EXCEPT
operators, but it seems to be extremely slow.
From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).
postgresql
add a comment |
I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id
and activity_id
like this
user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
I want to select all user_id
with activity_id = 1, 2 NOT 3
. In the case above it will be just one result: user_id = 2
. I can do it using SELECT DISTINCT
combined with INTERSECT
and EXCEPT
operators, but it seems to be extremely slow.
From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).
postgresql
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
Jan 14 at 14:46
@a_horse_with_no_nameactivity_id=4
changes nothing in such a case. It's 10.000.000 rows.
– Ximik
Jan 14 at 15:07
When you sayactivity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?
– Evan Carroll
Jan 14 at 16:55
add a comment |
I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id
and activity_id
like this
user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
I want to select all user_id
with activity_id = 1, 2 NOT 3
. In the case above it will be just one result: user_id = 2
. I can do it using SELECT DISTINCT
combined with INTERSECT
and EXCEPT
operators, but it seems to be extremely slow.
From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).
postgresql
I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id
and activity_id
like this
user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
I want to select all user_id
with activity_id = 1, 2 NOT 3
. In the case above it will be just one result: user_id = 2
. I can do it using SELECT DISTINCT
combined with INTERSECT
and EXCEPT
operators, but it seems to be extremely slow.
From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).
postgresql
postgresql
edited Jan 14 at 16:56
Evan Carroll
32k969219
32k969219
asked Jan 14 at 13:03
XimikXimik
1085
1085
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
Jan 14 at 14:46
@a_horse_with_no_nameactivity_id=4
changes nothing in such a case. It's 10.000.000 rows.
– Ximik
Jan 14 at 15:07
When you sayactivity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?
– Evan Carroll
Jan 14 at 16:55
add a comment |
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
Jan 14 at 14:46
@a_horse_with_no_nameactivity_id=4
changes nothing in such a case. It's 10.000.000 rows.
– Ximik
Jan 14 at 15:07
When you sayactivity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?
– Evan Carroll
Jan 14 at 16:55
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
Jan 14 at 14:46
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
Jan 14 at 14:46
@a_horse_with_no_name
activity_id=4
changes nothing in such a case. It's 10.000.000 rows.– Ximik
Jan 14 at 15:07
@a_horse_with_no_name
activity_id=4
changes nothing in such a case. It's 10.000.000 rows.– Ximik
Jan 14 at 15:07
When you say
activity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?– Evan Carroll
Jan 14 at 16:55
When you say
activity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?– Evan Carroll
Jan 14 at 16:55
add a comment |
3 Answers
3
active
oldest
votes
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
add a comment |
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.
– Ximik
Jan 18 at 14:52
add a comment |
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3));
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
add a comment |
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
add a comment |
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
edited Jan 14 at 13:53
answered Jan 14 at 13:23
a_horse_with_no_namea_horse_with_no_name
39.4k775112
39.4k775112
add a comment |
add a comment |
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.
– Ximik
Jan 18 at 14:52
add a comment |
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.
– Ximik
Jan 18 at 14:52
add a comment |
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
answered Jan 14 at 13:19
sticky bitsticky bit
1,899314
1,899314
Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.
– Ximik
Jan 18 at 14:52
add a comment |
Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.
– Ximik
Jan 18 at 14:52
Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.
– Ximik
Jan 18 at 14:52
Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.
– Ximik
Jan 18 at 14:52
add a comment |
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3));
add a comment |
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3));
add a comment |
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3));
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3));
edited Jan 15 at 14:59
answered Jan 14 at 16:54
Evan CarrollEvan Carroll
32k969219
32k969219
add a comment |
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
Jan 14 at 14:46
@a_horse_with_no_name
activity_id=4
changes nothing in such a case. It's 10.000.000 rows.– Ximik
Jan 14 at 15:07
When you say
activity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?– Evan Carroll
Jan 14 at 16:55