PostgreSQL: Correct way for subset calculations












1















I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this



user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2


I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.



From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).










share|improve this question

























  • Please post your table(s) definition(s), query, and execution plan.

    – Colin 't Hart
    Jan 14 at 14:46











  • @a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

    – Ximik
    Jan 14 at 15:07











  • When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

    – Evan Carroll
    Jan 14 at 16:55
















1















I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this



user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2


I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.



From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).










share|improve this question

























  • Please post your table(s) definition(s), query, and execution plan.

    – Colin 't Hart
    Jan 14 at 14:46











  • @a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

    – Ximik
    Jan 14 at 15:07











  • When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

    – Evan Carroll
    Jan 14 at 16:55














1












1








1








I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this



user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2


I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.



From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).










share|improve this question
















I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this



user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2


I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.



From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).







postgresql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 14 at 16:56









Evan Carroll

32k969219




32k969219










asked Jan 14 at 13:03









XimikXimik

1085




1085













  • Please post your table(s) definition(s), query, and execution plan.

    – Colin 't Hart
    Jan 14 at 14:46











  • @a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

    – Ximik
    Jan 14 at 15:07











  • When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

    – Evan Carroll
    Jan 14 at 16:55



















  • Please post your table(s) definition(s), query, and execution plan.

    – Colin 't Hart
    Jan 14 at 14:46











  • @a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

    – Ximik
    Jan 14 at 15:07











  • When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

    – Evan Carroll
    Jan 14 at 16:55

















Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
Jan 14 at 14:46





Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
Jan 14 at 14:46













@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
Jan 14 at 15:07





@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
Jan 14 at 15:07













When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
Jan 14 at 16:55





When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
Jan 14 at 16:55










3 Answers
3






active

oldest

votes


















3














You can easily do this with arrays in Postgres:



select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));




The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



Online example: https://rextester.com/YLN7221






share|improve this answer

































    2














    You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



    CREATE INDEX elbat_user_id_activity_id
    ON elbat (user_id,
    activity_id);





    SELECT DISTINCT t1.user_id
    FROM elbat t1
    WHERE EXISTS (SELECT *
    FROM elbat t2
    WHERE t2.user_id = t1.user_id
    AND t2.activity_id = '1')
    AND EXISTS (SELECT *
    FROM elbat t2
    WHERE t2.user_id = t1.user_id
    AND t2.activity_id = '2')
    AND NOT EXISTS (SELECT *
    FROM elbat t2
    WHERE t2.user_id = t1.user_id
    AND t2.activity_id = '3');


    If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.






    share|improve this answer
























    • Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

      – Ximik
      Jan 18 at 14:52



















    1














    You can use bool_or for this.



    SELECT user_id
    FROM users
    GROUP BY user_id
    HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
    AND NOT bool_or(activity_id IN (3));





    share|improve this answer

























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "182"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3














      You can easily do this with arrays in Postgres:



      select user_id, array_agg(activity_id) as activities
      from users
      group by user_id
      having array_agg(activity_id) @> array[1,2]
      and not 3 = any(array_agg(activity_id));




      The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



      If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



      On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



      Online example: https://rextester.com/YLN7221






      share|improve this answer






























        3














        You can easily do this with arrays in Postgres:



        select user_id, array_agg(activity_id) as activities
        from users
        group by user_id
        having array_agg(activity_id) @> array[1,2]
        and not 3 = any(array_agg(activity_id));




        The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



        If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



        On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



        Online example: https://rextester.com/YLN7221






        share|improve this answer




























          3












          3








          3







          You can easily do this with arrays in Postgres:



          select user_id, array_agg(activity_id) as activities
          from users
          group by user_id
          having array_agg(activity_id) @> array[1,2]
          and not 3 = any(array_agg(activity_id));




          The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



          If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



          On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



          Online example: https://rextester.com/YLN7221






          share|improve this answer















          You can easily do this with arrays in Postgres:



          select user_id, array_agg(activity_id) as activities
          from users
          group by user_id
          having array_agg(activity_id) @> array[1,2]
          and not 3 = any(array_agg(activity_id));




          The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



          If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



          On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



          Online example: https://rextester.com/YLN7221







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 14 at 13:53

























          answered Jan 14 at 13:23









          a_horse_with_no_namea_horse_with_no_name

          39.4k775112




          39.4k775112

























              2














              You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



              CREATE INDEX elbat_user_id_activity_id
              ON elbat (user_id,
              activity_id);





              SELECT DISTINCT t1.user_id
              FROM elbat t1
              WHERE EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '1')
              AND EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '2')
              AND NOT EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '3');


              If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.






              share|improve this answer
























              • Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

                – Ximik
                Jan 18 at 14:52
















              2














              You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



              CREATE INDEX elbat_user_id_activity_id
              ON elbat (user_id,
              activity_id);





              SELECT DISTINCT t1.user_id
              FROM elbat t1
              WHERE EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '1')
              AND EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '2')
              AND NOT EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '3');


              If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.






              share|improve this answer
























              • Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

                – Ximik
                Jan 18 at 14:52














              2












              2








              2







              You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



              CREATE INDEX elbat_user_id_activity_id
              ON elbat (user_id,
              activity_id);





              SELECT DISTINCT t1.user_id
              FROM elbat t1
              WHERE EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '1')
              AND EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '2')
              AND NOT EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '3');


              If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.






              share|improve this answer













              You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



              CREATE INDEX elbat_user_id_activity_id
              ON elbat (user_id,
              activity_id);





              SELECT DISTINCT t1.user_id
              FROM elbat t1
              WHERE EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '1')
              AND EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '2')
              AND NOT EXISTS (SELECT *
              FROM elbat t2
              WHERE t2.user_id = t1.user_id
              AND t2.activity_id = '3');


              If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Jan 14 at 13:19









              sticky bitsticky bit

              1,899314




              1,899314













              • Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

                – Ximik
                Jan 18 at 14:52



















              • Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

                – Ximik
                Jan 18 at 14:52

















              Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

              – Ximik
              Jan 18 at 14:52





              Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks.

              – Ximik
              Jan 18 at 14:52











              1














              You can use bool_or for this.



              SELECT user_id
              FROM users
              GROUP BY user_id
              HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
              AND NOT bool_or(activity_id IN (3));





              share|improve this answer






























                1














                You can use bool_or for this.



                SELECT user_id
                FROM users
                GROUP BY user_id
                HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
                AND NOT bool_or(activity_id IN (3));





                share|improve this answer




























                  1












                  1








                  1







                  You can use bool_or for this.



                  SELECT user_id
                  FROM users
                  GROUP BY user_id
                  HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
                  AND NOT bool_or(activity_id IN (3));





                  share|improve this answer















                  You can use bool_or for this.



                  SELECT user_id
                  FROM users
                  GROUP BY user_id
                  HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
                  AND NOT bool_or(activity_id IN (3));






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Jan 15 at 14:59

























                  answered Jan 14 at 16:54









                  Evan CarrollEvan Carroll

                  32k969219




                  32k969219






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Database Administrators Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Human spaceflight

                      Can not write log (Is /dev/pts mounted?) - openpty in Ubuntu-on-Windows?

                      張江高科駅