Which columns should be indexed when all may be used in different search queries?
Background
I'm working on a website for a movie theather chain currently located in four different cities (might expand in the future). They use the same, single-database website for all cities, which means I have to have a column in certain tables which holds the ID of the city that each row belongs to.
Right now I have three different tables:
Cinemas
- Contains each city's cinema (ID and name).
Movies
- Contains all movies that has been/will be shown at the cinema.
Showtimes
- Contains all showtimes for all movies in all cities.
The structure of the Showtimes
table is the following:
Column Name | Column Type | Description
--------------+--------------+---------------
ID | BIGINT | (Primary) Unique ID for each showtime (perhaps unnecessary?)
CinemaID | TINYINT | Foreign key bound to Cinemas.ID
MovieID | BIGINT | Foreign key bound to Movies.ID
Showtime | DATETIME | At what date and time the movie will show
(will contain multiple rows for each movie, i.e. one row for each showtime)
How this table will be used
A user of the website must be able to:
View all current/upcoming movies and showtimes (sorted by date) in the selected city.
Example query (backend):
SELECT MovieID, Showtime FROM Showtimes WHERE CinemaID = ? ORDER BY Showtime
Select a single movie and view all showtimes for that specific title only (in the selected city).
Example query:
SELECT Showtime FROM Showtimes WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime
Select a single day and view all movies and showtimes for that day only (in the selected city).
Example query:
SELECT MovieID, Showtime FROM Showtimes WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM])
So naturally I decided that I needed to create indexes for the columns.
Problem
What I'm having trouble with is deciding/determining how to index the columns properly. One index for each column seems quite expensive[1][2] so I started looking into composite indexes, which seems to be the right choice but also led to even more confusion.
From my understanding (based on what I've read) you should add the columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?) column the first in the composite index[3] (in my case that would be the Showtime
column). The only problem with that is that the index can only be used by the database if the first column is included in the search query[4][5], which it currently isn't in either of my queries.
Question
What kind of index(es) should I apply to my columns in order to cover all usage scenarios? (the last scenario may be omitted, but the first two are required)
Should I use a composite index on all columns, for some columns, or do I need a separate index for each column?
This table is updated at most a few times per week to add new showtimes.
Footnotes
1MySQL indexes - what are the best practices?
2Indexing every column in a table
3How important is the order of columns in indexes? (question)
4How important is the order of columns in indexes? (#2 top-voted answer)
5When should I use a composite index?
mysql database-design index innodb mariadb
add a comment |
Background
I'm working on a website for a movie theather chain currently located in four different cities (might expand in the future). They use the same, single-database website for all cities, which means I have to have a column in certain tables which holds the ID of the city that each row belongs to.
Right now I have three different tables:
Cinemas
- Contains each city's cinema (ID and name).
Movies
- Contains all movies that has been/will be shown at the cinema.
Showtimes
- Contains all showtimes for all movies in all cities.
The structure of the Showtimes
table is the following:
Column Name | Column Type | Description
--------------+--------------+---------------
ID | BIGINT | (Primary) Unique ID for each showtime (perhaps unnecessary?)
CinemaID | TINYINT | Foreign key bound to Cinemas.ID
MovieID | BIGINT | Foreign key bound to Movies.ID
Showtime | DATETIME | At what date and time the movie will show
(will contain multiple rows for each movie, i.e. one row for each showtime)
How this table will be used
A user of the website must be able to:
View all current/upcoming movies and showtimes (sorted by date) in the selected city.
Example query (backend):
SELECT MovieID, Showtime FROM Showtimes WHERE CinemaID = ? ORDER BY Showtime
Select a single movie and view all showtimes for that specific title only (in the selected city).
Example query:
SELECT Showtime FROM Showtimes WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime
Select a single day and view all movies and showtimes for that day only (in the selected city).
Example query:
SELECT MovieID, Showtime FROM Showtimes WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM])
So naturally I decided that I needed to create indexes for the columns.
Problem
What I'm having trouble with is deciding/determining how to index the columns properly. One index for each column seems quite expensive[1][2] so I started looking into composite indexes, which seems to be the right choice but also led to even more confusion.
From my understanding (based on what I've read) you should add the columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?) column the first in the composite index[3] (in my case that would be the Showtime
column). The only problem with that is that the index can only be used by the database if the first column is included in the search query[4][5], which it currently isn't in either of my queries.
Question
What kind of index(es) should I apply to my columns in order to cover all usage scenarios? (the last scenario may be omitted, but the first two are required)
Should I use a composite index on all columns, for some columns, or do I need a separate index for each column?
This table is updated at most a few times per week to add new showtimes.
Footnotes
1MySQL indexes - what are the best practices?
2Indexing every column in a table
3How important is the order of columns in indexes? (question)
4How important is the order of columns in indexes? (#2 top-voted answer)
5When should I use a composite index?
mysql database-design index innodb mariadb
"one row for each showtime" -- What about a cinaplex that has multiple screens with the same movie at the same time?
– Rick James
Feb 6 at 18:52
@RickJames : This chain (which is relatively small) currently only have one screen in each theater. But in case they'd get a second, would it be a problem to add another column for that without an index if we don't need to search for it (only retreive it)?
– Visual Vincent
Feb 6 at 18:56
2
(Upvote for a well-written Question.)
– Rick James
Feb 6 at 19:16
add a comment |
Background
I'm working on a website for a movie theather chain currently located in four different cities (might expand in the future). They use the same, single-database website for all cities, which means I have to have a column in certain tables which holds the ID of the city that each row belongs to.
Right now I have three different tables:
Cinemas
- Contains each city's cinema (ID and name).
Movies
- Contains all movies that has been/will be shown at the cinema.
Showtimes
- Contains all showtimes for all movies in all cities.
The structure of the Showtimes
table is the following:
Column Name | Column Type | Description
--------------+--------------+---------------
ID | BIGINT | (Primary) Unique ID for each showtime (perhaps unnecessary?)
CinemaID | TINYINT | Foreign key bound to Cinemas.ID
MovieID | BIGINT | Foreign key bound to Movies.ID
Showtime | DATETIME | At what date and time the movie will show
(will contain multiple rows for each movie, i.e. one row for each showtime)
How this table will be used
A user of the website must be able to:
View all current/upcoming movies and showtimes (sorted by date) in the selected city.
Example query (backend):
SELECT MovieID, Showtime FROM Showtimes WHERE CinemaID = ? ORDER BY Showtime
Select a single movie and view all showtimes for that specific title only (in the selected city).
Example query:
SELECT Showtime FROM Showtimes WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime
Select a single day and view all movies and showtimes for that day only (in the selected city).
Example query:
SELECT MovieID, Showtime FROM Showtimes WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM])
So naturally I decided that I needed to create indexes for the columns.
Problem
What I'm having trouble with is deciding/determining how to index the columns properly. One index for each column seems quite expensive[1][2] so I started looking into composite indexes, which seems to be the right choice but also led to even more confusion.
From my understanding (based on what I've read) you should add the columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?) column the first in the composite index[3] (in my case that would be the Showtime
column). The only problem with that is that the index can only be used by the database if the first column is included in the search query[4][5], which it currently isn't in either of my queries.
Question
What kind of index(es) should I apply to my columns in order to cover all usage scenarios? (the last scenario may be omitted, but the first two are required)
Should I use a composite index on all columns, for some columns, or do I need a separate index for each column?
This table is updated at most a few times per week to add new showtimes.
Footnotes
1MySQL indexes - what are the best practices?
2Indexing every column in a table
3How important is the order of columns in indexes? (question)
4How important is the order of columns in indexes? (#2 top-voted answer)
5When should I use a composite index?
mysql database-design index innodb mariadb
Background
I'm working on a website for a movie theather chain currently located in four different cities (might expand in the future). They use the same, single-database website for all cities, which means I have to have a column in certain tables which holds the ID of the city that each row belongs to.
Right now I have three different tables:
Cinemas
- Contains each city's cinema (ID and name).
Movies
- Contains all movies that has been/will be shown at the cinema.
Showtimes
- Contains all showtimes for all movies in all cities.
The structure of the Showtimes
table is the following:
Column Name | Column Type | Description
--------------+--------------+---------------
ID | BIGINT | (Primary) Unique ID for each showtime (perhaps unnecessary?)
CinemaID | TINYINT | Foreign key bound to Cinemas.ID
MovieID | BIGINT | Foreign key bound to Movies.ID
Showtime | DATETIME | At what date and time the movie will show
(will contain multiple rows for each movie, i.e. one row for each showtime)
How this table will be used
A user of the website must be able to:
View all current/upcoming movies and showtimes (sorted by date) in the selected city.
Example query (backend):
SELECT MovieID, Showtime FROM Showtimes WHERE CinemaID = ? ORDER BY Showtime
Select a single movie and view all showtimes for that specific title only (in the selected city).
Example query:
SELECT Showtime FROM Showtimes WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime
Select a single day and view all movies and showtimes for that day only (in the selected city).
Example query:
SELECT MovieID, Showtime FROM Showtimes WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM])
So naturally I decided that I needed to create indexes for the columns.
Problem
What I'm having trouble with is deciding/determining how to index the columns properly. One index for each column seems quite expensive[1][2] so I started looking into composite indexes, which seems to be the right choice but also led to even more confusion.
From my understanding (based on what I've read) you should add the columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?) column the first in the composite index[3] (in my case that would be the Showtime
column). The only problem with that is that the index can only be used by the database if the first column is included in the search query[4][5], which it currently isn't in either of my queries.
Question
What kind of index(es) should I apply to my columns in order to cover all usage scenarios? (the last scenario may be omitted, but the first two are required)
Should I use a composite index on all columns, for some columns, or do I need a separate index for each column?
This table is updated at most a few times per week to add new showtimes.
Footnotes
1MySQL indexes - what are the best practices?
2Indexing every column in a table
3How important is the order of columns in indexes? (question)
4How important is the order of columns in indexes? (#2 top-voted answer)
5When should I use a composite index?
mysql database-design index innodb mariadb
mysql database-design index innodb mariadb
edited Feb 6 at 19:36
Visual Vincent
asked Feb 6 at 18:18
Visual VincentVisual Vincent
1287
1287
"one row for each showtime" -- What about a cinaplex that has multiple screens with the same movie at the same time?
– Rick James
Feb 6 at 18:52
@RickJames : This chain (which is relatively small) currently only have one screen in each theater. But in case they'd get a second, would it be a problem to add another column for that without an index if we don't need to search for it (only retreive it)?
– Visual Vincent
Feb 6 at 18:56
2
(Upvote for a well-written Question.)
– Rick James
Feb 6 at 19:16
add a comment |
"one row for each showtime" -- What about a cinaplex that has multiple screens with the same movie at the same time?
– Rick James
Feb 6 at 18:52
@RickJames : This chain (which is relatively small) currently only have one screen in each theater. But in case they'd get a second, would it be a problem to add another column for that without an index if we don't need to search for it (only retreive it)?
– Visual Vincent
Feb 6 at 18:56
2
(Upvote for a well-written Question.)
– Rick James
Feb 6 at 19:16
"one row for each showtime" -- What about a cinaplex that has multiple screens with the same movie at the same time?
– Rick James
Feb 6 at 18:52
"one row for each showtime" -- What about a cinaplex that has multiple screens with the same movie at the same time?
– Rick James
Feb 6 at 18:52
@RickJames : This chain (which is relatively small) currently only have one screen in each theater. But in case they'd get a second, would it be a problem to add another column for that without an index if we don't need to search for it (only retreive it)?
– Visual Vincent
Feb 6 at 18:56
@RickJames : This chain (which is relatively small) currently only have one screen in each theater. But in case they'd get a second, would it be a problem to add another column for that without an index if we don't need to search for it (only retreive it)?
– Visual Vincent
Feb 6 at 18:56
2
2
(Upvote for a well-written Question.)
– Rick James
Feb 6 at 19:16
(Upvote for a well-written Question.)
– Rick James
Feb 6 at 19:16
add a comment |
4 Answers
4
active
oldest
votes
Composite Primary Key
I would define the primary key as a composite key of (CinemaID, MovieID, Showtime)
.
These 3 columns uniquely identify each row, and so having a separate ID
column is not necessary.
Composite (Secondary) Index
With this PK, the only additional index you will need for your queries is (CinemaID, Showtime)
.
Why these indexes?
A good way to think about indexes are used is to think of them as the order of columns in a spreadsheet.
Imagine a spreadsheed with (CinemaID, MovieID, Showtime)
that is sorted by each column successively.
All your queries have CinemaID
present, which means you can quickly find the "section" of the spreadsheet for that CinemaID
. Then, for your query that searches by MovieID
, you can easily find the "subsection" in the 2nd column, where MovieID
matches the searched for value.
As the 3rd column of Showtime
is also sorted, you can imagine how quickly and easily it is to find all those show times for that movie, in that cinema. The DBMS does things in a similar way and can retrieve those results extremely quickly.
As for your other queries, they all start with CinemaID
and then use Showtime
in some manner. They also need the MovieID
in their results.
So, the (CinemaID, Showtime)
index has you covered there. Again, the CinemaID
easily finds the "section" of the spreadsheet (in the analogy), and all possible showtimes (and there are going to be duplicates, assuming there is more than one screen) will be listed in order and are easily searched and/or sorted by those values.
Even better, since your primary key includes MovieID
, that column is included for all secondary indexes after the defined columns (at least for MySQL InnoDB - others engines too, but not necessarily all.)
Think of that is a 3rd column in the "spreadsheet" of our secondary index. The reason the column exists is to have all parts of the primary key available to do a lookup to the main table (aka the clustered index, in InnoDB), if necessary. In this simple case, no lookup is needed, so it is even more efficient as it doesn't require that double lookup.
Using just this primary key and a single secondary index, you should obtain excellent performance on any of the queries you listed.
Afterthought
My assumption that this uniquely identifies each row may NOT be correct if you have a movie showing on multiple screens at the same time. If you want to be able to separately identify those screens, then my solution is not best (I can provide another solution for that situation, please just let me know.)
Thanks for your answer and the thorough explanation! Currently the theather chain has only one screen in each theater. But would it, in case they got a second, be a problem to add another column for that without an index, if we don't need to search for/sort by the screen ID (only retreive it)?
– Visual Vincent
Feb 6 at 18:57
1
Adding a screenID won't be an issue UNLESS you need to have the situation where, at a given cinema, there is the same movie playing on both screens, AND starting at the same time. In that case, the primary key of(CinemaID, MovieID, Showtime)
does not permit you to enter both ScreenIDs, because only one record for each combination will exist. However, that said, and given that you say you don't need to sort by it, you CAN addScreenID
as the 4th column of the primary key. In that case, you can store both identical showings, and still differentiate by screen. So, either way you are fine.
– Willem Renzema
Feb 6 at 20:06
add a comment |
WHERE CinemaID = ? ORDER BY Showtime -- and
WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM]) -- need:
INDEX(CinemaID, Showtime)
WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime -- needs:
INDEX(CinemaID, MovieId, Showtime) -- or
INDEX(MovieId, CinemaID, Showtime)
Assuming that the the triple (MovieId, CinemaID, Showtime) is unique, I suggest getting rid of id
and having
PRIMARY KEY(CinemaID, MovieId, Showtime)
INDEX(CinemaID, Showtime)
Are there cases where the main part of the WHERE
is MovieID=...
?
Cinemas - A list of each city and its cinema (ID and name):
SELECT Cinema, CinemaID FROM Cimemas; -- (no index needed)
Movies - A list of movies that has been/will be shown at the cinema.
SELECT DISTINCT MovieID FROM ShowTimes WHERE CinemaID=...
INDEX(CinemaID, MovieID) -- already handled by my proposed PK
Showtimes - A list of all showtimes for all movies in all cities. -- This is a huge output; rethink the requirement for it. That is, think about what the Client will do with it.
Most of these indexes could be deduced from studying indexing cookbook and composite indexing
Note how I did not suggest indexes (single-column vs composite, plus specifics) until I saw the SELECTs
.
"columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?)" -- No. Selectivity is not the key for designing a composite index. Start with all the =
columns, in any order. (My Cookbook dwells on this topic.)
"index can only be used by the table if the first column is included in the search query" -- Mostly true. Notice how I recommended 2 indexes (remember: the PK is an index). There are cases where an index can be used for GROUP BY
or ORDER BY
, ignoring the WHERE
; but those are rare.
"first in the composite index ... Showtime" -- It is usually counterproductive to put a DATETIME
first in a composite index. In particular, your 3rd query can use both columns of INDEX(CinemaID, Showtime)
, but not both columns of INDEX(Showtime, CinemaID)
. It is easy to see this: Think of writing out two lists of cinemas and showtimes. Have one list sorted first on cinema (a la INDEX(cinema, time)
; have the other sorted on time. Think about which one would have all the rows for a particular cinema over a time range clumped ("clustered") together.
If there is also a ScreenID
due to two screens showing the same movie at the same time, tack it onto the PK. However, all the indexes will need rethinking.
(Sorry, Willem, I wrote my Answer before reading yours -- We say almost the same stuff.)
Sorry for the confusion. The first three bullet points are just a description of what tables I have and what they contain (i.e. "A list of all showtimes for all movies in all cities" means the table contains all the showtimes across the entire website, but they're not output all at once - they are what's queried by the next three bullet points located under "On the site the user must be able to").
– Visual Vincent
Feb 6 at 19:29
Edited question for (hopefully) better clarity. Now, I have a couple questions regarding your answer: 1) "Are there cases where the main part of the WHERE is MovieID=...?" - Do you mean when it's only checkingMovieID
or when it's checkingMovieID
first? (if the former: Not at the moment || if the latter: Also not at the moment, how exactly does it make things different?)
– Visual Vincent
Feb 6 at 19:45
1
@VisualVincent - The order of columns in an index is important. If, for example,ScreenID
is stuck in the middle of the index, but not used in theWHERE
, the rest of that index is useless for thatWHERE
.
– Rick James
Feb 6 at 19:57
1
@VisualVincent - A messy query: "Show all Tuesday matinee showings of G-rated movies." The likely technique is to index some of the stuff, then work harder to filter the rest of criteria.
– Rick James
Feb 6 at 20:06
1
@VisualVincent - Yes,ScreenID
last.
– Rick James
Feb 6 at 20:07
|
show 6 more comments
You are optimizing for size*, on an extremely small table. Since this basically a bridge table, it doesn’t really need a primary key, and even if it did, it wouldn’t need to be a bigint, the movies id doesn’t need to be a bigint either. If you think you need more than a couple of billion movies, then used unsigned but I think it’s unlikely that you’re going to run into that limitation for number of movies anytime soon.
Have an index on each of your primary terms, movie, cinema, showtime. See what your performance is, and if it’s unacceptable add additional index combinations.
AFTER you give it a try with no indexes at all (except possibly a pk on an auto_increment column used as a surrogate key) and a reasonable amount of data, say 200 cinemas X (5 new movies a week x 4 show times daily x 7 x (520 weeks aka 10 years)) = 14,560,000 records and see how it performs.
Frankly at this size, a full table scan shouldn’t be tooo bad.
Size wise a totally covering index is 13 bytes, rounds up to 16, so 16 x 14,560,000 / 1024 is 227,500k aka 227mb aka .277. Times 4 for the table and each of the possible covering index and we are up to almost 1.25 gig for the table and indexes. Note that any combination indexes would be the same size, just have a different weight to the data...
Note that the way to really speed this up, is to put your historical data in another table or use partitioning. In which case if you have the data for the upcoming month, it’s 5 x 4 x 7 x 30 x 200 = 120,000 records, and scanning 120k records should not be a problem. At that point you might just keep it all in memory and use a local dictionary to look things up.
Also note that I am assuming that you expand like crazy, as you currently have only one screen, so probably only 4-6 records a day per cinema, and a low number of cinemas, so a more realistic daily count is probably between 1k and 10k records per month. Totally possible you’d get decent enough performance out of this if you stored the values in an excel spreadsheet which was manually updated daily.
PS for index selectivity, you don’t want the most precise, you want the most coverage. You aren’t looking for a single record, you are looking for a group, which means you want the index to be on the column that returns your group. In particular, this means that if you want to look for movies by day, you should split the date and time into two columns so you can put an index on the date (and possibly one on the time).
*The reason I said you were optimizing for space is because you seem to be worried about the cost of the index. The cost of an index comes in two forms, a slower insert/delete/update and disk space. For a full covering indexes that basically means each index cost just as much as the table does. The time to add a single row to a table with 16 bytes (or 24) is small and non-accumulating (ie basically constant regardless of the number of rows). While disk space accumulates slowly, it does grow.
Thank you for your answer! I'm not optimizing for size (typo?), I'm optimizing for speed. When designing a system I generally have 2-3 things in mind: speed, sustainability and somewhat scalability. Even though I might not need an index right now, I want to for all of those three reasons; I want it to be durable in the future. For the same reasons I use large data types. I know an unsigned INT is more than enough right now, but remember when they said 4 billion IP addresses would be enough?
– Visual Vincent
Feb 7 at 10:24
I realize I'm being rather pessimistic about this, but I highly value speed and sustainability. And compared to processing power, disk space is relatively cheap so I'm okay with a couple of extra GBs for the indexes. -- Having that said, I will of course try out and measure your suggestion, and compare it against the others' to see which is the fastest.
– Visual Vincent
Feb 7 at 10:28
@VisualVincent: I’ve added to the answer explaining why you seem to be optimizing for size. As for data type size. Not really, 4 billion was (and is) enough for a totally different system. A private network, when it went public it was obviously not enough. Really the fact that it lasted so long, shows how over-engineered it was (also note how forward thinking they were, VOIP was considered). 2 billion movies obviously isn’t enough for YouTube, it is for any possible list of movies shown in a theater. Your cinema ID should really be an int, that is an obvious place to allow growth.
– jmoreno
Feb 7 at 11:44
I just compared the indexing described in the first two answers with your suggestion of doing a full table scan (both with no index whatsoever and with a single, separate primary key). With indexes it took approximately 1,5 ms to find a single record and 15-17 ms to find every record for a specific cinema (~3400 records returned), whereas when doing a full table scan it took approximately 200 ms to locate a single record (with a separate primary key).
– Visual Vincent
Feb 8 at 11:51
And I tested with only 398k records in the table, meaning if I would've had 14 million it could have taken up to 7 seconds to find a single record.
– Visual Vincent
Feb 8 at 12:00
|
show 5 more comments
First I would definitely not define the primary key as a composite key of (CinemaID, MovieID, Showtime)
. Think of what that means: if you want to link an entry you always need those data instead of just an id.
When you create a database I would recommend you to stick to the rules of normalization. They were made because of a reason.
I understand that you want to create it as fast and memory-saving as possible, but we have no idea yet were the bottle-neck will be.
Just create the scheme like you would do it perfectly normalized. Then fill it with test data (a few million entries) and see if the performance seriously is affected.
When you do that you also have some numbers like "this query takes 2 times more memory than the others", and then you can have a look at this.
Currently this smells of premature optimization .
Performance in software is like taste in food: you can guess if it is good, but in the end you have to try it anyway.
Thanks for your answer! As I always have access to all cinemas and all movies (and all showtimes, through the UI) I will always know all those key components when wanting to reference a specific showtime. Without knowing at least most of them, there won't be a reason to reference a specific row. Now that I've gotten more feedback (even the opposite of the first answers - no indexes at all) I will try and measure both suggested methods.
– Visual Vincent
Feb 7 at 7:09
Also, I know about the basic normalization rules but I don't see how I would have violated them. Please elaborate.
– Visual Vincent
Feb 7 at 7:10
I can see how it violates the Boyce–Codd normal form without making all columns the primary key like suggested in the first answers, but I don't see how the table could be changed for the better as it's pretty hard to keep a list of showtimes from what it is: a list. As I cannot have a list in a column I need (a) separate table(s) for it, but dividing it into too many tables will instead make it harder to maintain and use.
– Visual Vincent
Feb 7 at 7:30
I just tested both with and without a composite index and the results are mentioned in my comment on jmoreno's answer. Having a composite index is the best way to go.
– Visual Vincent
Feb 8 at 12:24
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f229064%2fwhich-columns-should-be-indexed-when-all-may-be-used-in-different-search-queries%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Composite Primary Key
I would define the primary key as a composite key of (CinemaID, MovieID, Showtime)
.
These 3 columns uniquely identify each row, and so having a separate ID
column is not necessary.
Composite (Secondary) Index
With this PK, the only additional index you will need for your queries is (CinemaID, Showtime)
.
Why these indexes?
A good way to think about indexes are used is to think of them as the order of columns in a spreadsheet.
Imagine a spreadsheed with (CinemaID, MovieID, Showtime)
that is sorted by each column successively.
All your queries have CinemaID
present, which means you can quickly find the "section" of the spreadsheet for that CinemaID
. Then, for your query that searches by MovieID
, you can easily find the "subsection" in the 2nd column, where MovieID
matches the searched for value.
As the 3rd column of Showtime
is also sorted, you can imagine how quickly and easily it is to find all those show times for that movie, in that cinema. The DBMS does things in a similar way and can retrieve those results extremely quickly.
As for your other queries, they all start with CinemaID
and then use Showtime
in some manner. They also need the MovieID
in their results.
So, the (CinemaID, Showtime)
index has you covered there. Again, the CinemaID
easily finds the "section" of the spreadsheet (in the analogy), and all possible showtimes (and there are going to be duplicates, assuming there is more than one screen) will be listed in order and are easily searched and/or sorted by those values.
Even better, since your primary key includes MovieID
, that column is included for all secondary indexes after the defined columns (at least for MySQL InnoDB - others engines too, but not necessarily all.)
Think of that is a 3rd column in the "spreadsheet" of our secondary index. The reason the column exists is to have all parts of the primary key available to do a lookup to the main table (aka the clustered index, in InnoDB), if necessary. In this simple case, no lookup is needed, so it is even more efficient as it doesn't require that double lookup.
Using just this primary key and a single secondary index, you should obtain excellent performance on any of the queries you listed.
Afterthought
My assumption that this uniquely identifies each row may NOT be correct if you have a movie showing on multiple screens at the same time. If you want to be able to separately identify those screens, then my solution is not best (I can provide another solution for that situation, please just let me know.)
Thanks for your answer and the thorough explanation! Currently the theather chain has only one screen in each theater. But would it, in case they got a second, be a problem to add another column for that without an index, if we don't need to search for/sort by the screen ID (only retreive it)?
– Visual Vincent
Feb 6 at 18:57
1
Adding a screenID won't be an issue UNLESS you need to have the situation where, at a given cinema, there is the same movie playing on both screens, AND starting at the same time. In that case, the primary key of(CinemaID, MovieID, Showtime)
does not permit you to enter both ScreenIDs, because only one record for each combination will exist. However, that said, and given that you say you don't need to sort by it, you CAN addScreenID
as the 4th column of the primary key. In that case, you can store both identical showings, and still differentiate by screen. So, either way you are fine.
– Willem Renzema
Feb 6 at 20:06
add a comment |
Composite Primary Key
I would define the primary key as a composite key of (CinemaID, MovieID, Showtime)
.
These 3 columns uniquely identify each row, and so having a separate ID
column is not necessary.
Composite (Secondary) Index
With this PK, the only additional index you will need for your queries is (CinemaID, Showtime)
.
Why these indexes?
A good way to think about indexes are used is to think of them as the order of columns in a spreadsheet.
Imagine a spreadsheed with (CinemaID, MovieID, Showtime)
that is sorted by each column successively.
All your queries have CinemaID
present, which means you can quickly find the "section" of the spreadsheet for that CinemaID
. Then, for your query that searches by MovieID
, you can easily find the "subsection" in the 2nd column, where MovieID
matches the searched for value.
As the 3rd column of Showtime
is also sorted, you can imagine how quickly and easily it is to find all those show times for that movie, in that cinema. The DBMS does things in a similar way and can retrieve those results extremely quickly.
As for your other queries, they all start with CinemaID
and then use Showtime
in some manner. They also need the MovieID
in their results.
So, the (CinemaID, Showtime)
index has you covered there. Again, the CinemaID
easily finds the "section" of the spreadsheet (in the analogy), and all possible showtimes (and there are going to be duplicates, assuming there is more than one screen) will be listed in order and are easily searched and/or sorted by those values.
Even better, since your primary key includes MovieID
, that column is included for all secondary indexes after the defined columns (at least for MySQL InnoDB - others engines too, but not necessarily all.)
Think of that is a 3rd column in the "spreadsheet" of our secondary index. The reason the column exists is to have all parts of the primary key available to do a lookup to the main table (aka the clustered index, in InnoDB), if necessary. In this simple case, no lookup is needed, so it is even more efficient as it doesn't require that double lookup.
Using just this primary key and a single secondary index, you should obtain excellent performance on any of the queries you listed.
Afterthought
My assumption that this uniquely identifies each row may NOT be correct if you have a movie showing on multiple screens at the same time. If you want to be able to separately identify those screens, then my solution is not best (I can provide another solution for that situation, please just let me know.)
Thanks for your answer and the thorough explanation! Currently the theather chain has only one screen in each theater. But would it, in case they got a second, be a problem to add another column for that without an index, if we don't need to search for/sort by the screen ID (only retreive it)?
– Visual Vincent
Feb 6 at 18:57
1
Adding a screenID won't be an issue UNLESS you need to have the situation where, at a given cinema, there is the same movie playing on both screens, AND starting at the same time. In that case, the primary key of(CinemaID, MovieID, Showtime)
does not permit you to enter both ScreenIDs, because only one record for each combination will exist. However, that said, and given that you say you don't need to sort by it, you CAN addScreenID
as the 4th column of the primary key. In that case, you can store both identical showings, and still differentiate by screen. So, either way you are fine.
– Willem Renzema
Feb 6 at 20:06
add a comment |
Composite Primary Key
I would define the primary key as a composite key of (CinemaID, MovieID, Showtime)
.
These 3 columns uniquely identify each row, and so having a separate ID
column is not necessary.
Composite (Secondary) Index
With this PK, the only additional index you will need for your queries is (CinemaID, Showtime)
.
Why these indexes?
A good way to think about indexes are used is to think of them as the order of columns in a spreadsheet.
Imagine a spreadsheed with (CinemaID, MovieID, Showtime)
that is sorted by each column successively.
All your queries have CinemaID
present, which means you can quickly find the "section" of the spreadsheet for that CinemaID
. Then, for your query that searches by MovieID
, you can easily find the "subsection" in the 2nd column, where MovieID
matches the searched for value.
As the 3rd column of Showtime
is also sorted, you can imagine how quickly and easily it is to find all those show times for that movie, in that cinema. The DBMS does things in a similar way and can retrieve those results extremely quickly.
As for your other queries, they all start with CinemaID
and then use Showtime
in some manner. They also need the MovieID
in their results.
So, the (CinemaID, Showtime)
index has you covered there. Again, the CinemaID
easily finds the "section" of the spreadsheet (in the analogy), and all possible showtimes (and there are going to be duplicates, assuming there is more than one screen) will be listed in order and are easily searched and/or sorted by those values.
Even better, since your primary key includes MovieID
, that column is included for all secondary indexes after the defined columns (at least for MySQL InnoDB - others engines too, but not necessarily all.)
Think of that is a 3rd column in the "spreadsheet" of our secondary index. The reason the column exists is to have all parts of the primary key available to do a lookup to the main table (aka the clustered index, in InnoDB), if necessary. In this simple case, no lookup is needed, so it is even more efficient as it doesn't require that double lookup.
Using just this primary key and a single secondary index, you should obtain excellent performance on any of the queries you listed.
Afterthought
My assumption that this uniquely identifies each row may NOT be correct if you have a movie showing on multiple screens at the same time. If you want to be able to separately identify those screens, then my solution is not best (I can provide another solution for that situation, please just let me know.)
Composite Primary Key
I would define the primary key as a composite key of (CinemaID, MovieID, Showtime)
.
These 3 columns uniquely identify each row, and so having a separate ID
column is not necessary.
Composite (Secondary) Index
With this PK, the only additional index you will need for your queries is (CinemaID, Showtime)
.
Why these indexes?
A good way to think about indexes are used is to think of them as the order of columns in a spreadsheet.
Imagine a spreadsheed with (CinemaID, MovieID, Showtime)
that is sorted by each column successively.
All your queries have CinemaID
present, which means you can quickly find the "section" of the spreadsheet for that CinemaID
. Then, for your query that searches by MovieID
, you can easily find the "subsection" in the 2nd column, where MovieID
matches the searched for value.
As the 3rd column of Showtime
is also sorted, you can imagine how quickly and easily it is to find all those show times for that movie, in that cinema. The DBMS does things in a similar way and can retrieve those results extremely quickly.
As for your other queries, they all start with CinemaID
and then use Showtime
in some manner. They also need the MovieID
in their results.
So, the (CinemaID, Showtime)
index has you covered there. Again, the CinemaID
easily finds the "section" of the spreadsheet (in the analogy), and all possible showtimes (and there are going to be duplicates, assuming there is more than one screen) will be listed in order and are easily searched and/or sorted by those values.
Even better, since your primary key includes MovieID
, that column is included for all secondary indexes after the defined columns (at least for MySQL InnoDB - others engines too, but not necessarily all.)
Think of that is a 3rd column in the "spreadsheet" of our secondary index. The reason the column exists is to have all parts of the primary key available to do a lookup to the main table (aka the clustered index, in InnoDB), if necessary. In this simple case, no lookup is needed, so it is even more efficient as it doesn't require that double lookup.
Using just this primary key and a single secondary index, you should obtain excellent performance on any of the queries you listed.
Afterthought
My assumption that this uniquely identifies each row may NOT be correct if you have a movie showing on multiple screens at the same time. If you want to be able to separately identify those screens, then my solution is not best (I can provide another solution for that situation, please just let me know.)
edited Feb 6 at 18:49
answered Feb 6 at 18:44
Willem RenzemaWillem Renzema
1,261168
1,261168
Thanks for your answer and the thorough explanation! Currently the theather chain has only one screen in each theater. But would it, in case they got a second, be a problem to add another column for that without an index, if we don't need to search for/sort by the screen ID (only retreive it)?
– Visual Vincent
Feb 6 at 18:57
1
Adding a screenID won't be an issue UNLESS you need to have the situation where, at a given cinema, there is the same movie playing on both screens, AND starting at the same time. In that case, the primary key of(CinemaID, MovieID, Showtime)
does not permit you to enter both ScreenIDs, because only one record for each combination will exist. However, that said, and given that you say you don't need to sort by it, you CAN addScreenID
as the 4th column of the primary key. In that case, you can store both identical showings, and still differentiate by screen. So, either way you are fine.
– Willem Renzema
Feb 6 at 20:06
add a comment |
Thanks for your answer and the thorough explanation! Currently the theather chain has only one screen in each theater. But would it, in case they got a second, be a problem to add another column for that without an index, if we don't need to search for/sort by the screen ID (only retreive it)?
– Visual Vincent
Feb 6 at 18:57
1
Adding a screenID won't be an issue UNLESS you need to have the situation where, at a given cinema, there is the same movie playing on both screens, AND starting at the same time. In that case, the primary key of(CinemaID, MovieID, Showtime)
does not permit you to enter both ScreenIDs, because only one record for each combination will exist. However, that said, and given that you say you don't need to sort by it, you CAN addScreenID
as the 4th column of the primary key. In that case, you can store both identical showings, and still differentiate by screen. So, either way you are fine.
– Willem Renzema
Feb 6 at 20:06
Thanks for your answer and the thorough explanation! Currently the theather chain has only one screen in each theater. But would it, in case they got a second, be a problem to add another column for that without an index, if we don't need to search for/sort by the screen ID (only retreive it)?
– Visual Vincent
Feb 6 at 18:57
Thanks for your answer and the thorough explanation! Currently the theather chain has only one screen in each theater. But would it, in case they got a second, be a problem to add another column for that without an index, if we don't need to search for/sort by the screen ID (only retreive it)?
– Visual Vincent
Feb 6 at 18:57
1
1
Adding a screenID won't be an issue UNLESS you need to have the situation where, at a given cinema, there is the same movie playing on both screens, AND starting at the same time. In that case, the primary key of
(CinemaID, MovieID, Showtime)
does not permit you to enter both ScreenIDs, because only one record for each combination will exist. However, that said, and given that you say you don't need to sort by it, you CAN add ScreenID
as the 4th column of the primary key. In that case, you can store both identical showings, and still differentiate by screen. So, either way you are fine.– Willem Renzema
Feb 6 at 20:06
Adding a screenID won't be an issue UNLESS you need to have the situation where, at a given cinema, there is the same movie playing on both screens, AND starting at the same time. In that case, the primary key of
(CinemaID, MovieID, Showtime)
does not permit you to enter both ScreenIDs, because only one record for each combination will exist. However, that said, and given that you say you don't need to sort by it, you CAN add ScreenID
as the 4th column of the primary key. In that case, you can store both identical showings, and still differentiate by screen. So, either way you are fine.– Willem Renzema
Feb 6 at 20:06
add a comment |
WHERE CinemaID = ? ORDER BY Showtime -- and
WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM]) -- need:
INDEX(CinemaID, Showtime)
WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime -- needs:
INDEX(CinemaID, MovieId, Showtime) -- or
INDEX(MovieId, CinemaID, Showtime)
Assuming that the the triple (MovieId, CinemaID, Showtime) is unique, I suggest getting rid of id
and having
PRIMARY KEY(CinemaID, MovieId, Showtime)
INDEX(CinemaID, Showtime)
Are there cases where the main part of the WHERE
is MovieID=...
?
Cinemas - A list of each city and its cinema (ID and name):
SELECT Cinema, CinemaID FROM Cimemas; -- (no index needed)
Movies - A list of movies that has been/will be shown at the cinema.
SELECT DISTINCT MovieID FROM ShowTimes WHERE CinemaID=...
INDEX(CinemaID, MovieID) -- already handled by my proposed PK
Showtimes - A list of all showtimes for all movies in all cities. -- This is a huge output; rethink the requirement for it. That is, think about what the Client will do with it.
Most of these indexes could be deduced from studying indexing cookbook and composite indexing
Note how I did not suggest indexes (single-column vs composite, plus specifics) until I saw the SELECTs
.
"columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?)" -- No. Selectivity is not the key for designing a composite index. Start with all the =
columns, in any order. (My Cookbook dwells on this topic.)
"index can only be used by the table if the first column is included in the search query" -- Mostly true. Notice how I recommended 2 indexes (remember: the PK is an index). There are cases where an index can be used for GROUP BY
or ORDER BY
, ignoring the WHERE
; but those are rare.
"first in the composite index ... Showtime" -- It is usually counterproductive to put a DATETIME
first in a composite index. In particular, your 3rd query can use both columns of INDEX(CinemaID, Showtime)
, but not both columns of INDEX(Showtime, CinemaID)
. It is easy to see this: Think of writing out two lists of cinemas and showtimes. Have one list sorted first on cinema (a la INDEX(cinema, time)
; have the other sorted on time. Think about which one would have all the rows for a particular cinema over a time range clumped ("clustered") together.
If there is also a ScreenID
due to two screens showing the same movie at the same time, tack it onto the PK. However, all the indexes will need rethinking.
(Sorry, Willem, I wrote my Answer before reading yours -- We say almost the same stuff.)
Sorry for the confusion. The first three bullet points are just a description of what tables I have and what they contain (i.e. "A list of all showtimes for all movies in all cities" means the table contains all the showtimes across the entire website, but they're not output all at once - they are what's queried by the next three bullet points located under "On the site the user must be able to").
– Visual Vincent
Feb 6 at 19:29
Edited question for (hopefully) better clarity. Now, I have a couple questions regarding your answer: 1) "Are there cases where the main part of the WHERE is MovieID=...?" - Do you mean when it's only checkingMovieID
or when it's checkingMovieID
first? (if the former: Not at the moment || if the latter: Also not at the moment, how exactly does it make things different?)
– Visual Vincent
Feb 6 at 19:45
1
@VisualVincent - The order of columns in an index is important. If, for example,ScreenID
is stuck in the middle of the index, but not used in theWHERE
, the rest of that index is useless for thatWHERE
.
– Rick James
Feb 6 at 19:57
1
@VisualVincent - A messy query: "Show all Tuesday matinee showings of G-rated movies." The likely technique is to index some of the stuff, then work harder to filter the rest of criteria.
– Rick James
Feb 6 at 20:06
1
@VisualVincent - Yes,ScreenID
last.
– Rick James
Feb 6 at 20:07
|
show 6 more comments
WHERE CinemaID = ? ORDER BY Showtime -- and
WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM]) -- need:
INDEX(CinemaID, Showtime)
WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime -- needs:
INDEX(CinemaID, MovieId, Showtime) -- or
INDEX(MovieId, CinemaID, Showtime)
Assuming that the the triple (MovieId, CinemaID, Showtime) is unique, I suggest getting rid of id
and having
PRIMARY KEY(CinemaID, MovieId, Showtime)
INDEX(CinemaID, Showtime)
Are there cases where the main part of the WHERE
is MovieID=...
?
Cinemas - A list of each city and its cinema (ID and name):
SELECT Cinema, CinemaID FROM Cimemas; -- (no index needed)
Movies - A list of movies that has been/will be shown at the cinema.
SELECT DISTINCT MovieID FROM ShowTimes WHERE CinemaID=...
INDEX(CinemaID, MovieID) -- already handled by my proposed PK
Showtimes - A list of all showtimes for all movies in all cities. -- This is a huge output; rethink the requirement for it. That is, think about what the Client will do with it.
Most of these indexes could be deduced from studying indexing cookbook and composite indexing
Note how I did not suggest indexes (single-column vs composite, plus specifics) until I saw the SELECTs
.
"columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?)" -- No. Selectivity is not the key for designing a composite index. Start with all the =
columns, in any order. (My Cookbook dwells on this topic.)
"index can only be used by the table if the first column is included in the search query" -- Mostly true. Notice how I recommended 2 indexes (remember: the PK is an index). There are cases where an index can be used for GROUP BY
or ORDER BY
, ignoring the WHERE
; but those are rare.
"first in the composite index ... Showtime" -- It is usually counterproductive to put a DATETIME
first in a composite index. In particular, your 3rd query can use both columns of INDEX(CinemaID, Showtime)
, but not both columns of INDEX(Showtime, CinemaID)
. It is easy to see this: Think of writing out two lists of cinemas and showtimes. Have one list sorted first on cinema (a la INDEX(cinema, time)
; have the other sorted on time. Think about which one would have all the rows for a particular cinema over a time range clumped ("clustered") together.
If there is also a ScreenID
due to two screens showing the same movie at the same time, tack it onto the PK. However, all the indexes will need rethinking.
(Sorry, Willem, I wrote my Answer before reading yours -- We say almost the same stuff.)
Sorry for the confusion. The first three bullet points are just a description of what tables I have and what they contain (i.e. "A list of all showtimes for all movies in all cities" means the table contains all the showtimes across the entire website, but they're not output all at once - they are what's queried by the next three bullet points located under "On the site the user must be able to").
– Visual Vincent
Feb 6 at 19:29
Edited question for (hopefully) better clarity. Now, I have a couple questions regarding your answer: 1) "Are there cases where the main part of the WHERE is MovieID=...?" - Do you mean when it's only checkingMovieID
or when it's checkingMovieID
first? (if the former: Not at the moment || if the latter: Also not at the moment, how exactly does it make things different?)
– Visual Vincent
Feb 6 at 19:45
1
@VisualVincent - The order of columns in an index is important. If, for example,ScreenID
is stuck in the middle of the index, but not used in theWHERE
, the rest of that index is useless for thatWHERE
.
– Rick James
Feb 6 at 19:57
1
@VisualVincent - A messy query: "Show all Tuesday matinee showings of G-rated movies." The likely technique is to index some of the stuff, then work harder to filter the rest of criteria.
– Rick James
Feb 6 at 20:06
1
@VisualVincent - Yes,ScreenID
last.
– Rick James
Feb 6 at 20:07
|
show 6 more comments
WHERE CinemaID = ? ORDER BY Showtime -- and
WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM]) -- need:
INDEX(CinemaID, Showtime)
WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime -- needs:
INDEX(CinemaID, MovieId, Showtime) -- or
INDEX(MovieId, CinemaID, Showtime)
Assuming that the the triple (MovieId, CinemaID, Showtime) is unique, I suggest getting rid of id
and having
PRIMARY KEY(CinemaID, MovieId, Showtime)
INDEX(CinemaID, Showtime)
Are there cases where the main part of the WHERE
is MovieID=...
?
Cinemas - A list of each city and its cinema (ID and name):
SELECT Cinema, CinemaID FROM Cimemas; -- (no index needed)
Movies - A list of movies that has been/will be shown at the cinema.
SELECT DISTINCT MovieID FROM ShowTimes WHERE CinemaID=...
INDEX(CinemaID, MovieID) -- already handled by my proposed PK
Showtimes - A list of all showtimes for all movies in all cities. -- This is a huge output; rethink the requirement for it. That is, think about what the Client will do with it.
Most of these indexes could be deduced from studying indexing cookbook and composite indexing
Note how I did not suggest indexes (single-column vs composite, plus specifics) until I saw the SELECTs
.
"columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?)" -- No. Selectivity is not the key for designing a composite index. Start with all the =
columns, in any order. (My Cookbook dwells on this topic.)
"index can only be used by the table if the first column is included in the search query" -- Mostly true. Notice how I recommended 2 indexes (remember: the PK is an index). There are cases where an index can be used for GROUP BY
or ORDER BY
, ignoring the WHERE
; but those are rare.
"first in the composite index ... Showtime" -- It is usually counterproductive to put a DATETIME
first in a composite index. In particular, your 3rd query can use both columns of INDEX(CinemaID, Showtime)
, but not both columns of INDEX(Showtime, CinemaID)
. It is easy to see this: Think of writing out two lists of cinemas and showtimes. Have one list sorted first on cinema (a la INDEX(cinema, time)
; have the other sorted on time. Think about which one would have all the rows for a particular cinema over a time range clumped ("clustered") together.
If there is also a ScreenID
due to two screens showing the same movie at the same time, tack it onto the PK. However, all the indexes will need rethinking.
(Sorry, Willem, I wrote my Answer before reading yours -- We say almost the same stuff.)
WHERE CinemaID = ? ORDER BY Showtime -- and
WHERE CinemaID = ? AND (Showtime BETWEEN [date 12:00 AM] AND [date 12:00 PM]) -- need:
INDEX(CinemaID, Showtime)
WHERE CinemaID = ? AND MovieID = ? ORDER BY Showtime -- needs:
INDEX(CinemaID, MovieId, Showtime) -- or
INDEX(MovieId, CinemaID, Showtime)
Assuming that the the triple (MovieId, CinemaID, Showtime) is unique, I suggest getting rid of id
and having
PRIMARY KEY(CinemaID, MovieId, Showtime)
INDEX(CinemaID, Showtime)
Are there cases where the main part of the WHERE
is MovieID=...
?
Cinemas - A list of each city and its cinema (ID and name):
SELECT Cinema, CinemaID FROM Cimemas; -- (no index needed)
Movies - A list of movies that has been/will be shown at the cinema.
SELECT DISTINCT MovieID FROM ShowTimes WHERE CinemaID=...
INDEX(CinemaID, MovieID) -- already handled by my proposed PK
Showtimes - A list of all showtimes for all movies in all cities. -- This is a huge output; rethink the requirement for it. That is, think about what the Client will do with it.
Most of these indexes could be deduced from studying indexing cookbook and composite indexing
Note how I did not suggest indexes (single-column vs composite, plus specifics) until I saw the SELECTs
.
"columns to the index by order of selectivity, making the most selective (I'm guessing that means the most unique/with the most cardinality?)" -- No. Selectivity is not the key for designing a composite index. Start with all the =
columns, in any order. (My Cookbook dwells on this topic.)
"index can only be used by the table if the first column is included in the search query" -- Mostly true. Notice how I recommended 2 indexes (remember: the PK is an index). There are cases where an index can be used for GROUP BY
or ORDER BY
, ignoring the WHERE
; but those are rare.
"first in the composite index ... Showtime" -- It is usually counterproductive to put a DATETIME
first in a composite index. In particular, your 3rd query can use both columns of INDEX(CinemaID, Showtime)
, but not both columns of INDEX(Showtime, CinemaID)
. It is easy to see this: Think of writing out two lists of cinemas and showtimes. Have one list sorted first on cinema (a la INDEX(cinema, time)
; have the other sorted on time. Think about which one would have all the rows for a particular cinema over a time range clumped ("clustered") together.
If there is also a ScreenID
due to two screens showing the same movie at the same time, tack it onto the PK. However, all the indexes will need rethinking.
(Sorry, Willem, I wrote my Answer before reading yours -- We say almost the same stuff.)
edited Feb 6 at 20:10
answered Feb 6 at 19:03
Rick JamesRick James
43.7k22259
43.7k22259
Sorry for the confusion. The first three bullet points are just a description of what tables I have and what they contain (i.e. "A list of all showtimes for all movies in all cities" means the table contains all the showtimes across the entire website, but they're not output all at once - they are what's queried by the next three bullet points located under "On the site the user must be able to").
– Visual Vincent
Feb 6 at 19:29
Edited question for (hopefully) better clarity. Now, I have a couple questions regarding your answer: 1) "Are there cases where the main part of the WHERE is MovieID=...?" - Do you mean when it's only checkingMovieID
or when it's checkingMovieID
first? (if the former: Not at the moment || if the latter: Also not at the moment, how exactly does it make things different?)
– Visual Vincent
Feb 6 at 19:45
1
@VisualVincent - The order of columns in an index is important. If, for example,ScreenID
is stuck in the middle of the index, but not used in theWHERE
, the rest of that index is useless for thatWHERE
.
– Rick James
Feb 6 at 19:57
1
@VisualVincent - A messy query: "Show all Tuesday matinee showings of G-rated movies." The likely technique is to index some of the stuff, then work harder to filter the rest of criteria.
– Rick James
Feb 6 at 20:06
1
@VisualVincent - Yes,ScreenID
last.
– Rick James
Feb 6 at 20:07
|
show 6 more comments
Sorry for the confusion. The first three bullet points are just a description of what tables I have and what they contain (i.e. "A list of all showtimes for all movies in all cities" means the table contains all the showtimes across the entire website, but they're not output all at once - they are what's queried by the next three bullet points located under "On the site the user must be able to").
– Visual Vincent
Feb 6 at 19:29
Edited question for (hopefully) better clarity. Now, I have a couple questions regarding your answer: 1) "Are there cases where the main part of the WHERE is MovieID=...?" - Do you mean when it's only checkingMovieID
or when it's checkingMovieID
first? (if the former: Not at the moment || if the latter: Also not at the moment, how exactly does it make things different?)
– Visual Vincent
Feb 6 at 19:45
1
@VisualVincent - The order of columns in an index is important. If, for example,ScreenID
is stuck in the middle of the index, but not used in theWHERE
, the rest of that index is useless for thatWHERE
.
– Rick James
Feb 6 at 19:57
1
@VisualVincent - A messy query: "Show all Tuesday matinee showings of G-rated movies." The likely technique is to index some of the stuff, then work harder to filter the rest of criteria.
– Rick James
Feb 6 at 20:06
1
@VisualVincent - Yes,ScreenID
last.
– Rick James
Feb 6 at 20:07
Sorry for the confusion. The first three bullet points are just a description of what tables I have and what they contain (i.e. "A list of all showtimes for all movies in all cities" means the table contains all the showtimes across the entire website, but they're not output all at once - they are what's queried by the next three bullet points located under "On the site the user must be able to").
– Visual Vincent
Feb 6 at 19:29
Sorry for the confusion. The first three bullet points are just a description of what tables I have and what they contain (i.e. "A list of all showtimes for all movies in all cities" means the table contains all the showtimes across the entire website, but they're not output all at once - they are what's queried by the next three bullet points located under "On the site the user must be able to").
– Visual Vincent
Feb 6 at 19:29
Edited question for (hopefully) better clarity. Now, I have a couple questions regarding your answer: 1) "Are there cases where the main part of the WHERE is MovieID=...?" - Do you mean when it's only checking
MovieID
or when it's checking MovieID
first? (if the former: Not at the moment || if the latter: Also not at the moment, how exactly does it make things different?)– Visual Vincent
Feb 6 at 19:45
Edited question for (hopefully) better clarity. Now, I have a couple questions regarding your answer: 1) "Are there cases where the main part of the WHERE is MovieID=...?" - Do you mean when it's only checking
MovieID
or when it's checking MovieID
first? (if the former: Not at the moment || if the latter: Also not at the moment, how exactly does it make things different?)– Visual Vincent
Feb 6 at 19:45
1
1
@VisualVincent - The order of columns in an index is important. If, for example,
ScreenID
is stuck in the middle of the index, but not used in the WHERE
, the rest of that index is useless for that WHERE
.– Rick James
Feb 6 at 19:57
@VisualVincent - The order of columns in an index is important. If, for example,
ScreenID
is stuck in the middle of the index, but not used in the WHERE
, the rest of that index is useless for that WHERE
.– Rick James
Feb 6 at 19:57
1
1
@VisualVincent - A messy query: "Show all Tuesday matinee showings of G-rated movies." The likely technique is to index some of the stuff, then work harder to filter the rest of criteria.
– Rick James
Feb 6 at 20:06
@VisualVincent - A messy query: "Show all Tuesday matinee showings of G-rated movies." The likely technique is to index some of the stuff, then work harder to filter the rest of criteria.
– Rick James
Feb 6 at 20:06
1
1
@VisualVincent - Yes,
ScreenID
last.– Rick James
Feb 6 at 20:07
@VisualVincent - Yes,
ScreenID
last.– Rick James
Feb 6 at 20:07
|
show 6 more comments
You are optimizing for size*, on an extremely small table. Since this basically a bridge table, it doesn’t really need a primary key, and even if it did, it wouldn’t need to be a bigint, the movies id doesn’t need to be a bigint either. If you think you need more than a couple of billion movies, then used unsigned but I think it’s unlikely that you’re going to run into that limitation for number of movies anytime soon.
Have an index on each of your primary terms, movie, cinema, showtime. See what your performance is, and if it’s unacceptable add additional index combinations.
AFTER you give it a try with no indexes at all (except possibly a pk on an auto_increment column used as a surrogate key) and a reasonable amount of data, say 200 cinemas X (5 new movies a week x 4 show times daily x 7 x (520 weeks aka 10 years)) = 14,560,000 records and see how it performs.
Frankly at this size, a full table scan shouldn’t be tooo bad.
Size wise a totally covering index is 13 bytes, rounds up to 16, so 16 x 14,560,000 / 1024 is 227,500k aka 227mb aka .277. Times 4 for the table and each of the possible covering index and we are up to almost 1.25 gig for the table and indexes. Note that any combination indexes would be the same size, just have a different weight to the data...
Note that the way to really speed this up, is to put your historical data in another table or use partitioning. In which case if you have the data for the upcoming month, it’s 5 x 4 x 7 x 30 x 200 = 120,000 records, and scanning 120k records should not be a problem. At that point you might just keep it all in memory and use a local dictionary to look things up.
Also note that I am assuming that you expand like crazy, as you currently have only one screen, so probably only 4-6 records a day per cinema, and a low number of cinemas, so a more realistic daily count is probably between 1k and 10k records per month. Totally possible you’d get decent enough performance out of this if you stored the values in an excel spreadsheet which was manually updated daily.
PS for index selectivity, you don’t want the most precise, you want the most coverage. You aren’t looking for a single record, you are looking for a group, which means you want the index to be on the column that returns your group. In particular, this means that if you want to look for movies by day, you should split the date and time into two columns so you can put an index on the date (and possibly one on the time).
*The reason I said you were optimizing for space is because you seem to be worried about the cost of the index. The cost of an index comes in two forms, a slower insert/delete/update and disk space. For a full covering indexes that basically means each index cost just as much as the table does. The time to add a single row to a table with 16 bytes (or 24) is small and non-accumulating (ie basically constant regardless of the number of rows). While disk space accumulates slowly, it does grow.
Thank you for your answer! I'm not optimizing for size (typo?), I'm optimizing for speed. When designing a system I generally have 2-3 things in mind: speed, sustainability and somewhat scalability. Even though I might not need an index right now, I want to for all of those three reasons; I want it to be durable in the future. For the same reasons I use large data types. I know an unsigned INT is more than enough right now, but remember when they said 4 billion IP addresses would be enough?
– Visual Vincent
Feb 7 at 10:24
I realize I'm being rather pessimistic about this, but I highly value speed and sustainability. And compared to processing power, disk space is relatively cheap so I'm okay with a couple of extra GBs for the indexes. -- Having that said, I will of course try out and measure your suggestion, and compare it against the others' to see which is the fastest.
– Visual Vincent
Feb 7 at 10:28
@VisualVincent: I’ve added to the answer explaining why you seem to be optimizing for size. As for data type size. Not really, 4 billion was (and is) enough for a totally different system. A private network, when it went public it was obviously not enough. Really the fact that it lasted so long, shows how over-engineered it was (also note how forward thinking they were, VOIP was considered). 2 billion movies obviously isn’t enough for YouTube, it is for any possible list of movies shown in a theater. Your cinema ID should really be an int, that is an obvious place to allow growth.
– jmoreno
Feb 7 at 11:44
I just compared the indexing described in the first two answers with your suggestion of doing a full table scan (both with no index whatsoever and with a single, separate primary key). With indexes it took approximately 1,5 ms to find a single record and 15-17 ms to find every record for a specific cinema (~3400 records returned), whereas when doing a full table scan it took approximately 200 ms to locate a single record (with a separate primary key).
– Visual Vincent
Feb 8 at 11:51
And I tested with only 398k records in the table, meaning if I would've had 14 million it could have taken up to 7 seconds to find a single record.
– Visual Vincent
Feb 8 at 12:00
|
show 5 more comments
You are optimizing for size*, on an extremely small table. Since this basically a bridge table, it doesn’t really need a primary key, and even if it did, it wouldn’t need to be a bigint, the movies id doesn’t need to be a bigint either. If you think you need more than a couple of billion movies, then used unsigned but I think it’s unlikely that you’re going to run into that limitation for number of movies anytime soon.
Have an index on each of your primary terms, movie, cinema, showtime. See what your performance is, and if it’s unacceptable add additional index combinations.
AFTER you give it a try with no indexes at all (except possibly a pk on an auto_increment column used as a surrogate key) and a reasonable amount of data, say 200 cinemas X (5 new movies a week x 4 show times daily x 7 x (520 weeks aka 10 years)) = 14,560,000 records and see how it performs.
Frankly at this size, a full table scan shouldn’t be tooo bad.
Size wise a totally covering index is 13 bytes, rounds up to 16, so 16 x 14,560,000 / 1024 is 227,500k aka 227mb aka .277. Times 4 for the table and each of the possible covering index and we are up to almost 1.25 gig for the table and indexes. Note that any combination indexes would be the same size, just have a different weight to the data...
Note that the way to really speed this up, is to put your historical data in another table or use partitioning. In which case if you have the data for the upcoming month, it’s 5 x 4 x 7 x 30 x 200 = 120,000 records, and scanning 120k records should not be a problem. At that point you might just keep it all in memory and use a local dictionary to look things up.
Also note that I am assuming that you expand like crazy, as you currently have only one screen, so probably only 4-6 records a day per cinema, and a low number of cinemas, so a more realistic daily count is probably between 1k and 10k records per month. Totally possible you’d get decent enough performance out of this if you stored the values in an excel spreadsheet which was manually updated daily.
PS for index selectivity, you don’t want the most precise, you want the most coverage. You aren’t looking for a single record, you are looking for a group, which means you want the index to be on the column that returns your group. In particular, this means that if you want to look for movies by day, you should split the date and time into two columns so you can put an index on the date (and possibly one on the time).
*The reason I said you were optimizing for space is because you seem to be worried about the cost of the index. The cost of an index comes in two forms, a slower insert/delete/update and disk space. For a full covering indexes that basically means each index cost just as much as the table does. The time to add a single row to a table with 16 bytes (or 24) is small and non-accumulating (ie basically constant regardless of the number of rows). While disk space accumulates slowly, it does grow.
Thank you for your answer! I'm not optimizing for size (typo?), I'm optimizing for speed. When designing a system I generally have 2-3 things in mind: speed, sustainability and somewhat scalability. Even though I might not need an index right now, I want to for all of those three reasons; I want it to be durable in the future. For the same reasons I use large data types. I know an unsigned INT is more than enough right now, but remember when they said 4 billion IP addresses would be enough?
– Visual Vincent
Feb 7 at 10:24
I realize I'm being rather pessimistic about this, but I highly value speed and sustainability. And compared to processing power, disk space is relatively cheap so I'm okay with a couple of extra GBs for the indexes. -- Having that said, I will of course try out and measure your suggestion, and compare it against the others' to see which is the fastest.
– Visual Vincent
Feb 7 at 10:28
@VisualVincent: I’ve added to the answer explaining why you seem to be optimizing for size. As for data type size. Not really, 4 billion was (and is) enough for a totally different system. A private network, when it went public it was obviously not enough. Really the fact that it lasted so long, shows how over-engineered it was (also note how forward thinking they were, VOIP was considered). 2 billion movies obviously isn’t enough for YouTube, it is for any possible list of movies shown in a theater. Your cinema ID should really be an int, that is an obvious place to allow growth.
– jmoreno
Feb 7 at 11:44
I just compared the indexing described in the first two answers with your suggestion of doing a full table scan (both with no index whatsoever and with a single, separate primary key). With indexes it took approximately 1,5 ms to find a single record and 15-17 ms to find every record for a specific cinema (~3400 records returned), whereas when doing a full table scan it took approximately 200 ms to locate a single record (with a separate primary key).
– Visual Vincent
Feb 8 at 11:51
And I tested with only 398k records in the table, meaning if I would've had 14 million it could have taken up to 7 seconds to find a single record.
– Visual Vincent
Feb 8 at 12:00
|
show 5 more comments
You are optimizing for size*, on an extremely small table. Since this basically a bridge table, it doesn’t really need a primary key, and even if it did, it wouldn’t need to be a bigint, the movies id doesn’t need to be a bigint either. If you think you need more than a couple of billion movies, then used unsigned but I think it’s unlikely that you’re going to run into that limitation for number of movies anytime soon.
Have an index on each of your primary terms, movie, cinema, showtime. See what your performance is, and if it’s unacceptable add additional index combinations.
AFTER you give it a try with no indexes at all (except possibly a pk on an auto_increment column used as a surrogate key) and a reasonable amount of data, say 200 cinemas X (5 new movies a week x 4 show times daily x 7 x (520 weeks aka 10 years)) = 14,560,000 records and see how it performs.
Frankly at this size, a full table scan shouldn’t be tooo bad.
Size wise a totally covering index is 13 bytes, rounds up to 16, so 16 x 14,560,000 / 1024 is 227,500k aka 227mb aka .277. Times 4 for the table and each of the possible covering index and we are up to almost 1.25 gig for the table and indexes. Note that any combination indexes would be the same size, just have a different weight to the data...
Note that the way to really speed this up, is to put your historical data in another table or use partitioning. In which case if you have the data for the upcoming month, it’s 5 x 4 x 7 x 30 x 200 = 120,000 records, and scanning 120k records should not be a problem. At that point you might just keep it all in memory and use a local dictionary to look things up.
Also note that I am assuming that you expand like crazy, as you currently have only one screen, so probably only 4-6 records a day per cinema, and a low number of cinemas, so a more realistic daily count is probably between 1k and 10k records per month. Totally possible you’d get decent enough performance out of this if you stored the values in an excel spreadsheet which was manually updated daily.
PS for index selectivity, you don’t want the most precise, you want the most coverage. You aren’t looking for a single record, you are looking for a group, which means you want the index to be on the column that returns your group. In particular, this means that if you want to look for movies by day, you should split the date and time into two columns so you can put an index on the date (and possibly one on the time).
*The reason I said you were optimizing for space is because you seem to be worried about the cost of the index. The cost of an index comes in two forms, a slower insert/delete/update and disk space. For a full covering indexes that basically means each index cost just as much as the table does. The time to add a single row to a table with 16 bytes (or 24) is small and non-accumulating (ie basically constant regardless of the number of rows). While disk space accumulates slowly, it does grow.
You are optimizing for size*, on an extremely small table. Since this basically a bridge table, it doesn’t really need a primary key, and even if it did, it wouldn’t need to be a bigint, the movies id doesn’t need to be a bigint either. If you think you need more than a couple of billion movies, then used unsigned but I think it’s unlikely that you’re going to run into that limitation for number of movies anytime soon.
Have an index on each of your primary terms, movie, cinema, showtime. See what your performance is, and if it’s unacceptable add additional index combinations.
AFTER you give it a try with no indexes at all (except possibly a pk on an auto_increment column used as a surrogate key) and a reasonable amount of data, say 200 cinemas X (5 new movies a week x 4 show times daily x 7 x (520 weeks aka 10 years)) = 14,560,000 records and see how it performs.
Frankly at this size, a full table scan shouldn’t be tooo bad.
Size wise a totally covering index is 13 bytes, rounds up to 16, so 16 x 14,560,000 / 1024 is 227,500k aka 227mb aka .277. Times 4 for the table and each of the possible covering index and we are up to almost 1.25 gig for the table and indexes. Note that any combination indexes would be the same size, just have a different weight to the data...
Note that the way to really speed this up, is to put your historical data in another table or use partitioning. In which case if you have the data for the upcoming month, it’s 5 x 4 x 7 x 30 x 200 = 120,000 records, and scanning 120k records should not be a problem. At that point you might just keep it all in memory and use a local dictionary to look things up.
Also note that I am assuming that you expand like crazy, as you currently have only one screen, so probably only 4-6 records a day per cinema, and a low number of cinemas, so a more realistic daily count is probably between 1k and 10k records per month. Totally possible you’d get decent enough performance out of this if you stored the values in an excel spreadsheet which was manually updated daily.
PS for index selectivity, you don’t want the most precise, you want the most coverage. You aren’t looking for a single record, you are looking for a group, which means you want the index to be on the column that returns your group. In particular, this means that if you want to look for movies by day, you should split the date and time into two columns so you can put an index on the date (and possibly one on the time).
*The reason I said you were optimizing for space is because you seem to be worried about the cost of the index. The cost of an index comes in two forms, a slower insert/delete/update and disk space. For a full covering indexes that basically means each index cost just as much as the table does. The time to add a single row to a table with 16 bytes (or 24) is small and non-accumulating (ie basically constant regardless of the number of rows). While disk space accumulates slowly, it does grow.
edited Feb 7 at 11:21
answered Feb 7 at 6:21
jmorenojmoreno
644516
644516
Thank you for your answer! I'm not optimizing for size (typo?), I'm optimizing for speed. When designing a system I generally have 2-3 things in mind: speed, sustainability and somewhat scalability. Even though I might not need an index right now, I want to for all of those three reasons; I want it to be durable in the future. For the same reasons I use large data types. I know an unsigned INT is more than enough right now, but remember when they said 4 billion IP addresses would be enough?
– Visual Vincent
Feb 7 at 10:24
I realize I'm being rather pessimistic about this, but I highly value speed and sustainability. And compared to processing power, disk space is relatively cheap so I'm okay with a couple of extra GBs for the indexes. -- Having that said, I will of course try out and measure your suggestion, and compare it against the others' to see which is the fastest.
– Visual Vincent
Feb 7 at 10:28
@VisualVincent: I’ve added to the answer explaining why you seem to be optimizing for size. As for data type size. Not really, 4 billion was (and is) enough for a totally different system. A private network, when it went public it was obviously not enough. Really the fact that it lasted so long, shows how over-engineered it was (also note how forward thinking they were, VOIP was considered). 2 billion movies obviously isn’t enough for YouTube, it is for any possible list of movies shown in a theater. Your cinema ID should really be an int, that is an obvious place to allow growth.
– jmoreno
Feb 7 at 11:44
I just compared the indexing described in the first two answers with your suggestion of doing a full table scan (both with no index whatsoever and with a single, separate primary key). With indexes it took approximately 1,5 ms to find a single record and 15-17 ms to find every record for a specific cinema (~3400 records returned), whereas when doing a full table scan it took approximately 200 ms to locate a single record (with a separate primary key).
– Visual Vincent
Feb 8 at 11:51
And I tested with only 398k records in the table, meaning if I would've had 14 million it could have taken up to 7 seconds to find a single record.
– Visual Vincent
Feb 8 at 12:00
|
show 5 more comments
Thank you for your answer! I'm not optimizing for size (typo?), I'm optimizing for speed. When designing a system I generally have 2-3 things in mind: speed, sustainability and somewhat scalability. Even though I might not need an index right now, I want to for all of those three reasons; I want it to be durable in the future. For the same reasons I use large data types. I know an unsigned INT is more than enough right now, but remember when they said 4 billion IP addresses would be enough?
– Visual Vincent
Feb 7 at 10:24
I realize I'm being rather pessimistic about this, but I highly value speed and sustainability. And compared to processing power, disk space is relatively cheap so I'm okay with a couple of extra GBs for the indexes. -- Having that said, I will of course try out and measure your suggestion, and compare it against the others' to see which is the fastest.
– Visual Vincent
Feb 7 at 10:28
@VisualVincent: I’ve added to the answer explaining why you seem to be optimizing for size. As for data type size. Not really, 4 billion was (and is) enough for a totally different system. A private network, when it went public it was obviously not enough. Really the fact that it lasted so long, shows how over-engineered it was (also note how forward thinking they were, VOIP was considered). 2 billion movies obviously isn’t enough for YouTube, it is for any possible list of movies shown in a theater. Your cinema ID should really be an int, that is an obvious place to allow growth.
– jmoreno
Feb 7 at 11:44
I just compared the indexing described in the first two answers with your suggestion of doing a full table scan (both with no index whatsoever and with a single, separate primary key). With indexes it took approximately 1,5 ms to find a single record and 15-17 ms to find every record for a specific cinema (~3400 records returned), whereas when doing a full table scan it took approximately 200 ms to locate a single record (with a separate primary key).
– Visual Vincent
Feb 8 at 11:51
And I tested with only 398k records in the table, meaning if I would've had 14 million it could have taken up to 7 seconds to find a single record.
– Visual Vincent
Feb 8 at 12:00
Thank you for your answer! I'm not optimizing for size (typo?), I'm optimizing for speed. When designing a system I generally have 2-3 things in mind: speed, sustainability and somewhat scalability. Even though I might not need an index right now, I want to for all of those three reasons; I want it to be durable in the future. For the same reasons I use large data types. I know an unsigned INT is more than enough right now, but remember when they said 4 billion IP addresses would be enough?
– Visual Vincent
Feb 7 at 10:24
Thank you for your answer! I'm not optimizing for size (typo?), I'm optimizing for speed. When designing a system I generally have 2-3 things in mind: speed, sustainability and somewhat scalability. Even though I might not need an index right now, I want to for all of those three reasons; I want it to be durable in the future. For the same reasons I use large data types. I know an unsigned INT is more than enough right now, but remember when they said 4 billion IP addresses would be enough?
– Visual Vincent
Feb 7 at 10:24
I realize I'm being rather pessimistic about this, but I highly value speed and sustainability. And compared to processing power, disk space is relatively cheap so I'm okay with a couple of extra GBs for the indexes. -- Having that said, I will of course try out and measure your suggestion, and compare it against the others' to see which is the fastest.
– Visual Vincent
Feb 7 at 10:28
I realize I'm being rather pessimistic about this, but I highly value speed and sustainability. And compared to processing power, disk space is relatively cheap so I'm okay with a couple of extra GBs for the indexes. -- Having that said, I will of course try out and measure your suggestion, and compare it against the others' to see which is the fastest.
– Visual Vincent
Feb 7 at 10:28
@VisualVincent: I’ve added to the answer explaining why you seem to be optimizing for size. As for data type size. Not really, 4 billion was (and is) enough for a totally different system. A private network, when it went public it was obviously not enough. Really the fact that it lasted so long, shows how over-engineered it was (also note how forward thinking they were, VOIP was considered). 2 billion movies obviously isn’t enough for YouTube, it is for any possible list of movies shown in a theater. Your cinema ID should really be an int, that is an obvious place to allow growth.
– jmoreno
Feb 7 at 11:44
@VisualVincent: I’ve added to the answer explaining why you seem to be optimizing for size. As for data type size. Not really, 4 billion was (and is) enough for a totally different system. A private network, when it went public it was obviously not enough. Really the fact that it lasted so long, shows how over-engineered it was (also note how forward thinking they were, VOIP was considered). 2 billion movies obviously isn’t enough for YouTube, it is for any possible list of movies shown in a theater. Your cinema ID should really be an int, that is an obvious place to allow growth.
– jmoreno
Feb 7 at 11:44
I just compared the indexing described in the first two answers with your suggestion of doing a full table scan (both with no index whatsoever and with a single, separate primary key). With indexes it took approximately 1,5 ms to find a single record and 15-17 ms to find every record for a specific cinema (~3400 records returned), whereas when doing a full table scan it took approximately 200 ms to locate a single record (with a separate primary key).
– Visual Vincent
Feb 8 at 11:51
I just compared the indexing described in the first two answers with your suggestion of doing a full table scan (both with no index whatsoever and with a single, separate primary key). With indexes it took approximately 1,5 ms to find a single record and 15-17 ms to find every record for a specific cinema (~3400 records returned), whereas when doing a full table scan it took approximately 200 ms to locate a single record (with a separate primary key).
– Visual Vincent
Feb 8 at 11:51
And I tested with only 398k records in the table, meaning if I would've had 14 million it could have taken up to 7 seconds to find a single record.
– Visual Vincent
Feb 8 at 12:00
And I tested with only 398k records in the table, meaning if I would've had 14 million it could have taken up to 7 seconds to find a single record.
– Visual Vincent
Feb 8 at 12:00
|
show 5 more comments
First I would definitely not define the primary key as a composite key of (CinemaID, MovieID, Showtime)
. Think of what that means: if you want to link an entry you always need those data instead of just an id.
When you create a database I would recommend you to stick to the rules of normalization. They were made because of a reason.
I understand that you want to create it as fast and memory-saving as possible, but we have no idea yet were the bottle-neck will be.
Just create the scheme like you would do it perfectly normalized. Then fill it with test data (a few million entries) and see if the performance seriously is affected.
When you do that you also have some numbers like "this query takes 2 times more memory than the others", and then you can have a look at this.
Currently this smells of premature optimization .
Performance in software is like taste in food: you can guess if it is good, but in the end you have to try it anyway.
Thanks for your answer! As I always have access to all cinemas and all movies (and all showtimes, through the UI) I will always know all those key components when wanting to reference a specific showtime. Without knowing at least most of them, there won't be a reason to reference a specific row. Now that I've gotten more feedback (even the opposite of the first answers - no indexes at all) I will try and measure both suggested methods.
– Visual Vincent
Feb 7 at 7:09
Also, I know about the basic normalization rules but I don't see how I would have violated them. Please elaborate.
– Visual Vincent
Feb 7 at 7:10
I can see how it violates the Boyce–Codd normal form without making all columns the primary key like suggested in the first answers, but I don't see how the table could be changed for the better as it's pretty hard to keep a list of showtimes from what it is: a list. As I cannot have a list in a column I need (a) separate table(s) for it, but dividing it into too many tables will instead make it harder to maintain and use.
– Visual Vincent
Feb 7 at 7:30
I just tested both with and without a composite index and the results are mentioned in my comment on jmoreno's answer. Having a composite index is the best way to go.
– Visual Vincent
Feb 8 at 12:24
add a comment |
First I would definitely not define the primary key as a composite key of (CinemaID, MovieID, Showtime)
. Think of what that means: if you want to link an entry you always need those data instead of just an id.
When you create a database I would recommend you to stick to the rules of normalization. They were made because of a reason.
I understand that you want to create it as fast and memory-saving as possible, but we have no idea yet were the bottle-neck will be.
Just create the scheme like you would do it perfectly normalized. Then fill it with test data (a few million entries) and see if the performance seriously is affected.
When you do that you also have some numbers like "this query takes 2 times more memory than the others", and then you can have a look at this.
Currently this smells of premature optimization .
Performance in software is like taste in food: you can guess if it is good, but in the end you have to try it anyway.
Thanks for your answer! As I always have access to all cinemas and all movies (and all showtimes, through the UI) I will always know all those key components when wanting to reference a specific showtime. Without knowing at least most of them, there won't be a reason to reference a specific row. Now that I've gotten more feedback (even the opposite of the first answers - no indexes at all) I will try and measure both suggested methods.
– Visual Vincent
Feb 7 at 7:09
Also, I know about the basic normalization rules but I don't see how I would have violated them. Please elaborate.
– Visual Vincent
Feb 7 at 7:10
I can see how it violates the Boyce–Codd normal form without making all columns the primary key like suggested in the first answers, but I don't see how the table could be changed for the better as it's pretty hard to keep a list of showtimes from what it is: a list. As I cannot have a list in a column I need (a) separate table(s) for it, but dividing it into too many tables will instead make it harder to maintain and use.
– Visual Vincent
Feb 7 at 7:30
I just tested both with and without a composite index and the results are mentioned in my comment on jmoreno's answer. Having a composite index is the best way to go.
– Visual Vincent
Feb 8 at 12:24
add a comment |
First I would definitely not define the primary key as a composite key of (CinemaID, MovieID, Showtime)
. Think of what that means: if you want to link an entry you always need those data instead of just an id.
When you create a database I would recommend you to stick to the rules of normalization. They were made because of a reason.
I understand that you want to create it as fast and memory-saving as possible, but we have no idea yet were the bottle-neck will be.
Just create the scheme like you would do it perfectly normalized. Then fill it with test data (a few million entries) and see if the performance seriously is affected.
When you do that you also have some numbers like "this query takes 2 times more memory than the others", and then you can have a look at this.
Currently this smells of premature optimization .
Performance in software is like taste in food: you can guess if it is good, but in the end you have to try it anyway.
First I would definitely not define the primary key as a composite key of (CinemaID, MovieID, Showtime)
. Think of what that means: if you want to link an entry you always need those data instead of just an id.
When you create a database I would recommend you to stick to the rules of normalization. They were made because of a reason.
I understand that you want to create it as fast and memory-saving as possible, but we have no idea yet were the bottle-neck will be.
Just create the scheme like you would do it perfectly normalized. Then fill it with test data (a few million entries) and see if the performance seriously is affected.
When you do that you also have some numbers like "this query takes 2 times more memory than the others", and then you can have a look at this.
Currently this smells of premature optimization .
Performance in software is like taste in food: you can guess if it is good, but in the end you have to try it anyway.
answered Feb 7 at 2:19
EmarciEmarci
1
1
Thanks for your answer! As I always have access to all cinemas and all movies (and all showtimes, through the UI) I will always know all those key components when wanting to reference a specific showtime. Without knowing at least most of them, there won't be a reason to reference a specific row. Now that I've gotten more feedback (even the opposite of the first answers - no indexes at all) I will try and measure both suggested methods.
– Visual Vincent
Feb 7 at 7:09
Also, I know about the basic normalization rules but I don't see how I would have violated them. Please elaborate.
– Visual Vincent
Feb 7 at 7:10
I can see how it violates the Boyce–Codd normal form without making all columns the primary key like suggested in the first answers, but I don't see how the table could be changed for the better as it's pretty hard to keep a list of showtimes from what it is: a list. As I cannot have a list in a column I need (a) separate table(s) for it, but dividing it into too many tables will instead make it harder to maintain and use.
– Visual Vincent
Feb 7 at 7:30
I just tested both with and without a composite index and the results are mentioned in my comment on jmoreno's answer. Having a composite index is the best way to go.
– Visual Vincent
Feb 8 at 12:24
add a comment |
Thanks for your answer! As I always have access to all cinemas and all movies (and all showtimes, through the UI) I will always know all those key components when wanting to reference a specific showtime. Without knowing at least most of them, there won't be a reason to reference a specific row. Now that I've gotten more feedback (even the opposite of the first answers - no indexes at all) I will try and measure both suggested methods.
– Visual Vincent
Feb 7 at 7:09
Also, I know about the basic normalization rules but I don't see how I would have violated them. Please elaborate.
– Visual Vincent
Feb 7 at 7:10
I can see how it violates the Boyce–Codd normal form without making all columns the primary key like suggested in the first answers, but I don't see how the table could be changed for the better as it's pretty hard to keep a list of showtimes from what it is: a list. As I cannot have a list in a column I need (a) separate table(s) for it, but dividing it into too many tables will instead make it harder to maintain and use.
– Visual Vincent
Feb 7 at 7:30
I just tested both with and without a composite index and the results are mentioned in my comment on jmoreno's answer. Having a composite index is the best way to go.
– Visual Vincent
Feb 8 at 12:24
Thanks for your answer! As I always have access to all cinemas and all movies (and all showtimes, through the UI) I will always know all those key components when wanting to reference a specific showtime. Without knowing at least most of them, there won't be a reason to reference a specific row. Now that I've gotten more feedback (even the opposite of the first answers - no indexes at all) I will try and measure both suggested methods.
– Visual Vincent
Feb 7 at 7:09
Thanks for your answer! As I always have access to all cinemas and all movies (and all showtimes, through the UI) I will always know all those key components when wanting to reference a specific showtime. Without knowing at least most of them, there won't be a reason to reference a specific row. Now that I've gotten more feedback (even the opposite of the first answers - no indexes at all) I will try and measure both suggested methods.
– Visual Vincent
Feb 7 at 7:09
Also, I know about the basic normalization rules but I don't see how I would have violated them. Please elaborate.
– Visual Vincent
Feb 7 at 7:10
Also, I know about the basic normalization rules but I don't see how I would have violated them. Please elaborate.
– Visual Vincent
Feb 7 at 7:10
I can see how it violates the Boyce–Codd normal form without making all columns the primary key like suggested in the first answers, but I don't see how the table could be changed for the better as it's pretty hard to keep a list of showtimes from what it is: a list. As I cannot have a list in a column I need (a) separate table(s) for it, but dividing it into too many tables will instead make it harder to maintain and use.
– Visual Vincent
Feb 7 at 7:30
I can see how it violates the Boyce–Codd normal form without making all columns the primary key like suggested in the first answers, but I don't see how the table could be changed for the better as it's pretty hard to keep a list of showtimes from what it is: a list. As I cannot have a list in a column I need (a) separate table(s) for it, but dividing it into too many tables will instead make it harder to maintain and use.
– Visual Vincent
Feb 7 at 7:30
I just tested both with and without a composite index and the results are mentioned in my comment on jmoreno's answer. Having a composite index is the best way to go.
– Visual Vincent
Feb 8 at 12:24
I just tested both with and without a composite index and the results are mentioned in my comment on jmoreno's answer. Having a composite index is the best way to go.
– Visual Vincent
Feb 8 at 12:24
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f229064%2fwhich-columns-should-be-indexed-when-all-may-be-used-in-different-search-queries%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
"one row for each showtime" -- What about a cinaplex that has multiple screens with the same movie at the same time?
– Rick James
Feb 6 at 18:52
@RickJames : This chain (which is relatively small) currently only have one screen in each theater. But in case they'd get a second, would it be a problem to add another column for that without an index if we don't need to search for it (only retreive it)?
– Visual Vincent
Feb 6 at 18:56
2
(Upvote for a well-written Question.)
– Rick James
Feb 6 at 19:16