PostgreSQL: Correct way for subset calculations

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id

---------------------

1       | 1

1       | 2

1       | 3

2       | 1

2       | 2

I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.

From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).

edited Jan 14 at 16:56

Evan Carroll

32k969219

asked Jan 14 at 13:03

Ximik

1085

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
Jan 14 at 14:46

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
Jan 14 at 15:07

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
Jan 14 at 16:55

add a comment |

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id

---------------------

1       | 1

1       | 2

1       | 3

2       | 1

2       | 2

edited Jan 14 at 16:56

Evan Carroll

32k969219

asked Jan 14 at 13:03

Ximik

1085

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
Jan 14 at 14:46

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
Jan 14 at 15:07

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
Jan 14 at 16:55

add a comment |

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id

---------------------

1       | 1

1       | 2

1       | 3

2       | 1

2       | 2

edited Jan 14 at 16:56

Evan Carroll

32k969219

asked Jan 14 at 13:03

Ximik

1085

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id

---------------------

1       | 1

1       | 2

1       | 3

2       | 1

2       | 2

postgresql

edited Jan 14 at 16:56

Evan Carroll

32k969219

asked Jan 14 at 13:03

Ximik

1085

edited Jan 14 at 16:56

Evan Carroll

32k969219

asked Jan 14 at 13:03

Ximik

1085

edited Jan 14 at 16:56

Evan Carroll

32k969219

edited Jan 14 at 16:56

Evan Carroll

32k969219

edited Jan 14 at 16:56

Evan Carroll

32k969219

asked Jan 14 at 13:03

Ximik

1085

asked Jan 14 at 13:03

Ximik

1085

asked Jan 14 at 13:03

Ximik

1085

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
Jan 14 at 14:46

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
Jan 14 at 15:07

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
Jan 14 at 16:55

add a comment |

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
Jan 14 at 14:46

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
Jan 14 at 15:07

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
Jan 14 at 16:55

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
Jan 14 at 14:46

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
Jan 14 at 15:07

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
Jan 14 at 16:55

add a comment |

3 Answers
3

active

oldest

votes

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3

If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited Jan 14 at 13:53

answered Jan 14 at 13:23

a_horse_with_no_name

39.4k775112

add a comment |

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered Jan 14 at 13:19

sticky bit

1,899314

Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

– Ximik
Jan 18 at 14:52

add a comment |

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3));

edited Jan 15 at 14:59

answered Jan 14 at 16:54

Evan Carroll

32k969219

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited Jan 14 at 13:53

answered Jan 14 at 13:23

a_horse_with_no_name

39.4k775112

add a comment |

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited Jan 14 at 13:53

answered Jan 14 at 13:23

a_horse_with_no_name

39.4k775112

add a comment |

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited Jan 14 at 13:53

answered Jan 14 at 13:23

a_horse_with_no_name

39.4k775112

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited Jan 14 at 13:53

answered Jan 14 at 13:23

a_horse_with_no_name

39.4k775112

edited Jan 14 at 13:53

answered Jan 14 at 13:23

a_horse_with_no_name

39.4k775112

answered Jan 14 at 13:23

a_horse_with_no_name

39.4k775112

answered Jan 14 at 13:23

a_horse_with_no_name

39.4k775112

add a comment |

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered Jan 14 at 13:19

sticky bit

1,899314

Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

– Ximik
Jan 18 at 14:52

add a comment |

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered Jan 14 at 13:19

sticky bit

1,899314

Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

– Ximik
Jan 18 at 14:52

add a comment |

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered Jan 14 at 13:19

sticky bit

1,899314

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered Jan 14 at 13:19

sticky bit

1,899314

answered Jan 14 at 13:19

sticky bit

1,899314

answered Jan 14 at 13:19

sticky bit

1,899314

answered Jan 14 at 13:19

sticky bit

1,899314

Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

– Ximik
Jan 18 at 14:52

add a comment |

Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

– Ximik
Jan 18 at 14:52

Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

– Ximik
Jan 18 at 14:52

add a comment |

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3));

edited Jan 15 at 14:59

answered Jan 14 at 16:54

Evan Carroll

32k969219

add a comment |

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3));

edited Jan 15 at 14:59

answered Jan 14 at 16:54

Evan Carroll

32k969219

add a comment |

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3));

edited Jan 15 at 14:59

answered Jan 14 at 16:54

Evan Carroll

32k969219

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3));

edited Jan 15 at 14:59

answered Jan 14 at 16:54

Evan Carroll

32k969219

edited Jan 15 at 14:59

answered Jan 14 at 16:54

Evan Carroll

32k969219

answered Jan 14 at 16:54

Evan Carroll

32k969219

answered Jan 14 at 16:54

Evan Carroll

32k969219

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

QkvBiiR7 SPR,s od8p7,LZ49Eft1VIjTu1gfA1,ZD

搜尋此網誌

Dtyjlui