Nested Loop has very low estimates due to skewed data

On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.

If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.

But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.

Is SQL Server only using the Coarse Histogram Estimation if there is no where clause?

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

Is there a way to improve the estimations on the nested loop?

Using following code you can reproduce the data:

create table MyTable

(

    id int identity,

    field varchar(50),

    constraint  pk_id primary  key clustered (id)

)

go



create table SkewedTable

(

    id int identity,

    startdate datetime,

    myTableId int,

    remark varchar(50),

    constraint  pk_id primary  key clustered (id)

)



set nocount on



insert into MyTable select top 1000 [name] from master..spt_values

go



insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)

go 1000



insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)

go 



CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)

go



CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)

go



--95=varchar in sys.messages

set nocount off



;with cte as

( 

    select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark

    union all

    select * from cte

)

insert into skewedtable select top 40000 * from cte

option(maxrecursion 0)

go



update statistics mytable with fullscan

go



update statistics skewedtable with fullscan

go

edited Jan 22 at 10:16

Paul White♦

52.2k14278451

asked Jan 22 at 9:46

Frederik Vanderhaegen

8331418

add a comment |

On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

Is there a way to improve the estimations on the nested loop?

Using following code you can reproduce the data:

create table MyTable

(

    id int identity,

    field varchar(50),

    constraint  pk_id primary  key clustered (id)

)

go



create table SkewedTable

(

    id int identity,

    startdate datetime,

    myTableId int,

    remark varchar(50),

    constraint  pk_id primary  key clustered (id)

)



set nocount on



insert into MyTable select top 1000 [name] from master..spt_values

go



insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)

go 1000



insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)

go 



CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)

go



CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)

go



--95=varchar in sys.messages

set nocount off



;with cte as

( 

    select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark

    union all

    select * from cte

)

insert into skewedtable select top 40000 * from cte

option(maxrecursion 0)

go



update statistics mytable with fullscan

go



update statistics skewedtable with fullscan

go

edited Jan 22 at 10:16

Paul White♦

52.2k14278451

asked Jan 22 at 9:46

Frederik Vanderhaegen

8331418

add a comment |

On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

Is there a way to improve the estimations on the nested loop?

Using following code you can reproduce the data:

create table MyTable

(

    id int identity,

    field varchar(50),

    constraint  pk_id primary  key clustered (id)

)

go



create table SkewedTable

(

    id int identity,

    startdate datetime,

    myTableId int,

    remark varchar(50),

    constraint  pk_id primary  key clustered (id)

)



set nocount on



insert into MyTable select top 1000 [name] from master..spt_values

go



insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)

go 1000



insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)

go 



CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)

go



CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)

go



--95=varchar in sys.messages

set nocount off



;with cte as

( 

    select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark

    union all

    select * from cte

)

insert into skewedtable select top 40000 * from cte

option(maxrecursion 0)

go



update statistics mytable with fullscan

go



update statistics skewedtable with fullscan

go

edited Jan 22 at 10:16

Paul White♦

52.2k14278451

asked Jan 22 at 9:46

Frederik Vanderhaegen

8331418

On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

Is there a way to improve the estimations on the nested loop?

Using following code you can reproduce the data:

create table MyTable

(

    id int identity,

    field varchar(50),

    constraint  pk_id primary  key clustered (id)

)

go



create table SkewedTable

(

    id int identity,

    startdate datetime,

    myTableId int,

    remark varchar(50),

    constraint  pk_id primary  key clustered (id)

)



set nocount on



insert into MyTable select top 1000 [name] from master..spt_values

go



insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)

go 1000



insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)

go 



CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)

go



CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)

go



--95=varchar in sys.messages

set nocount off



;with cte as

( 

    select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark

    union all

    select * from cte

)

insert into skewedtable select top 40000 * from cte

option(maxrecursion 0)

go



update statistics mytable with fullscan

go



update statistics skewedtable with fullscan

go

sql-server query-performance sql-server-2016 cardinality-estimates

edited Jan 22 at 10:16

Paul White♦

52.2k14278451

asked Jan 22 at 9:46

Frederik Vanderhaegen

8331418

edited Jan 22 at 10:16

Paul White♦

52.2k14278451

asked Jan 22 at 9:46

Frederik Vanderhaegen

8331418

edited Jan 22 at 10:16

Paul White♦

52.2k14278451

edited Jan 22 at 10:16

Paul White♦

52.2k14278451

edited Jan 22 at 10:16

Paul White♦

52.2k14278451

asked Jan 22 at 9:46

Frederik Vanderhaegen

8331418

asked Jan 22 at 9:46

Frederik Vanderhaegen

8331418

asked Jan 22 at 9:46

Frederik Vanderhaegen

8331418

add a comment |

2 Answers
2

active

oldest

votes

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

You should find the following filtered statistic helpful:

CREATE STATISTICS [stats id (field=varchar)]

ON dbo.MyTable (id)

WHERE field = 'varchar'

WITH FULLSCAN;

This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:

Actual execution plan

The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).

This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).

If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.

edited Jan 22 at 10:53

answered Jan 22 at 10:31

Paul White♦

52.2k14278451

add a comment |

Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT operator

You could add an index or change the existing index:

CREATE INDEX IX_SkewedTable_MytableId_startdate

ON SkewedTable(myTableId,startdate)

INCLUDE(remark);

Insert the values in an intermediate temp table

CREATE TABLE  #temp2(param int);

INSERT INTO  #temp2(param)

SELECT t.id 

FROM mytable t 

WHERE t.field = 'varchar';

enter image description here

Add an index on the temp table

CREATE INDEX IX_ID on #temp2(param);

And then use a CTE to remove the sort operator from the query plan

;WITH CTE AS

(

select TOP(999999999999)

s.myTableId,s.id,s.remark from 

SkewedTable s

order by startdate

)

SELECT s.id , s.remark

from CTE  s

INNER JOIN #temp2 

on s.myTableId = #temp2.param

OPTION(RECOMPILE)

As mentioned by @Forrest to push the sort lower Here

Result:

enter image description here

Which removes the sort operator.

edited Jan 22 at 13:23

answered Jan 22 at 11:28

Randi Vertongen

2,876721

1

Top 100 percent hasn't worked in ages - the optimizer throw it out. You're better off using a number like the int or big int max.

– Erik Darling
Jan 22 at 12:40

@ErikDarling Changed it, thanks for the feedback

– Randi Vertongen
Jan 22 at 13:14

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227747%2fnested-loop-has-very-low-estimates-due-to-skewed-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

You should find the following filtered statistic helpful:

CREATE STATISTICS [stats id (field=varchar)]

ON dbo.MyTable (id)

WHERE field = 'varchar'

WITH FULLSCAN;

This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:

Actual execution plan

The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).

edited Jan 22 at 10:53

answered Jan 22 at 10:31

Paul White♦

52.2k14278451

add a comment |

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

You should find the following filtered statistic helpful:

CREATE STATISTICS [stats id (field=varchar)]

ON dbo.MyTable (id)

WHERE field = 'varchar'

WITH FULLSCAN;

This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:

Actual execution plan

The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).

edited Jan 22 at 10:53

answered Jan 22 at 10:31

Paul White♦

52.2k14278451

add a comment |

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

You should find the following filtered statistic helpful:

CREATE STATISTICS [stats id (field=varchar)]

ON dbo.MyTable (id)

WHERE field = 'varchar'

WITH FULLSCAN;

This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:

Actual execution plan

The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).

edited Jan 22 at 10:53

answered Jan 22 at 10:31

Paul White♦

52.2k14278451

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

You should find the following filtered statistic helpful:

CREATE STATISTICS [stats id (field=varchar)]

ON dbo.MyTable (id)

WHERE field = 'varchar'

WITH FULLSCAN;

This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:

Actual execution plan

The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).

edited Jan 22 at 10:53

answered Jan 22 at 10:31

Paul White♦

52.2k14278451

edited Jan 22 at 10:53

answered Jan 22 at 10:31

Paul White♦

52.2k14278451

answered Jan 22 at 10:31

Paul White♦

52.2k14278451

answered Jan 22 at 10:31

Paul White♦

52.2k14278451

add a comment |

You could add an index or change the existing index:

CREATE INDEX IX_SkewedTable_MytableId_startdate

ON SkewedTable(myTableId,startdate)

INCLUDE(remark);

Insert the values in an intermediate temp table

CREATE TABLE  #temp2(param int);

INSERT INTO  #temp2(param)

SELECT t.id 

FROM mytable t 

WHERE t.field = 'varchar';

enter image description here

Add an index on the temp table

CREATE INDEX IX_ID on #temp2(param);

And then use a CTE to remove the sort operator from the query plan

;WITH CTE AS

(

select TOP(999999999999)

s.myTableId,s.id,s.remark from 

SkewedTable s

order by startdate

)

SELECT s.id , s.remark

from CTE  s

INNER JOIN #temp2 

on s.myTableId = #temp2.param

OPTION(RECOMPILE)

As mentioned by @Forrest to push the sort lower Here

Result:

enter image description here

Which removes the sort operator.

edited Jan 22 at 13:23

answered Jan 22 at 11:28

Randi Vertongen

2,876721

1

Top 100 percent hasn't worked in ages - the optimizer throw it out. You're better off using a number like the int or big int max.

– Erik Darling
Jan 22 at 12:40

@ErikDarling Changed it, thanks for the feedback

– Randi Vertongen
Jan 22 at 13:14

add a comment |

You could add an index or change the existing index:

CREATE INDEX IX_SkewedTable_MytableId_startdate

ON SkewedTable(myTableId,startdate)

INCLUDE(remark);

Insert the values in an intermediate temp table

CREATE TABLE  #temp2(param int);

INSERT INTO  #temp2(param)

SELECT t.id 

FROM mytable t 

WHERE t.field = 'varchar';

enter image description here

Add an index on the temp table

CREATE INDEX IX_ID on #temp2(param);

And then use a CTE to remove the sort operator from the query plan

;WITH CTE AS

(

select TOP(999999999999)

s.myTableId,s.id,s.remark from 

SkewedTable s

order by startdate

)

SELECT s.id , s.remark

from CTE  s

INNER JOIN #temp2 

on s.myTableId = #temp2.param

OPTION(RECOMPILE)

As mentioned by @Forrest to push the sort lower Here

Result:

enter image description here

Which removes the sort operator.

edited Jan 22 at 13:23

answered Jan 22 at 11:28

Randi Vertongen

2,876721

1

Top 100 percent hasn't worked in ages - the optimizer throw it out. You're better off using a number like the int or big int max.

– Erik Darling
Jan 22 at 12:40

@ErikDarling Changed it, thanks for the feedback

– Randi Vertongen
Jan 22 at 13:14

add a comment |

You could add an index or change the existing index:

CREATE INDEX IX_SkewedTable_MytableId_startdate

ON SkewedTable(myTableId,startdate)

INCLUDE(remark);

Insert the values in an intermediate temp table

CREATE TABLE  #temp2(param int);

INSERT INTO  #temp2(param)

SELECT t.id 

FROM mytable t 

WHERE t.field = 'varchar';

enter image description here

Add an index on the temp table

CREATE INDEX IX_ID on #temp2(param);

And then use a CTE to remove the sort operator from the query plan

;WITH CTE AS

(

select TOP(999999999999)

s.myTableId,s.id,s.remark from 

SkewedTable s

order by startdate

)

SELECT s.id , s.remark

from CTE  s

INNER JOIN #temp2 

on s.myTableId = #temp2.param

OPTION(RECOMPILE)

As mentioned by @Forrest to push the sort lower Here

Result:

enter image description here

Which removes the sort operator.

edited Jan 22 at 13:23

answered Jan 22 at 11:28

Randi Vertongen

2,876721

You could add an index or change the existing index:

CREATE INDEX IX_SkewedTable_MytableId_startdate

ON SkewedTable(myTableId,startdate)

INCLUDE(remark);

Insert the values in an intermediate temp table

CREATE TABLE  #temp2(param int);

INSERT INTO  #temp2(param)

SELECT t.id 

FROM mytable t 

WHERE t.field = 'varchar';

enter image description here

Add an index on the temp table

CREATE INDEX IX_ID on #temp2(param);

And then use a CTE to remove the sort operator from the query plan

;WITH CTE AS

(

select TOP(999999999999)

s.myTableId,s.id,s.remark from 

SkewedTable s

order by startdate

)

SELECT s.id , s.remark

from CTE  s

INNER JOIN #temp2 

on s.myTableId = #temp2.param

OPTION(RECOMPILE)

As mentioned by @Forrest to push the sort lower Here

Result:

enter image description here

Which removes the sort operator.

edited Jan 22 at 13:23

answered Jan 22 at 11:28

Randi Vertongen

2,876721

edited Jan 22 at 13:23

answered Jan 22 at 11:28

Randi Vertongen

2,876721

answered Jan 22 at 11:28

Randi Vertongen

2,876721

answered Jan 22 at 11:28

Randi Vertongen

2,876721

1

Top 100 percent hasn't worked in ages - the optimizer throw it out. You're better off using a number like the int or big int max.

– Erik Darling
Jan 22 at 12:40

@ErikDarling Changed it, thanks for the feedback

– Randi Vertongen
Jan 22 at 13:14

add a comment |

1

Top 100 percent hasn't worked in ages - the optimizer throw it out. You're better off using a number like the int or big int max.

– Erik Darling
Jan 22 at 12:40

@ErikDarling Changed it, thanks for the feedback

– Randi Vertongen
Jan 22 at 13:14

Top 100 percent hasn't worked in ages - the optimizer throw it out. You're better off using a number like the int or big int max.

– Erik Darling
Jan 22 at 12:40

@ErikDarling Changed it, thanks for the feedback

– Randi Vertongen
Jan 22 at 13:14

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dtyjlui