“Sequence Duplication Levels” module still fails after pre-processing Illumina data

I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.

How can i solve this problem? Thank You.

enter image description here

edited Jan 10 at 16:14

Daniel Standage

2,150327

asked Jan 10 at 14:35

yy97

132

2

$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37

$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43

2

$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27

add a comment |

I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.

How can i solve this problem? Thank You.

enter image description here

edited Jan 10 at 16:14

Daniel Standage

2,150327

asked Jan 10 at 14:35

yy97

132

2

$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37

$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43

2

$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27

add a comment |

I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.

How can i solve this problem? Thank You.

enter image description here

edited Jan 10 at 16:14

Daniel Standage

2,150327

asked Jan 10 at 14:35

yy97

132

I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.

How can i solve this problem? Thank You.

enter image description here

illumina data-preprocessing trimming fastqc

edited Jan 10 at 16:14

Daniel Standage

2,150327

asked Jan 10 at 14:35

yy97

132

edited Jan 10 at 16:14

Daniel Standage

2,150327

asked Jan 10 at 14:35

yy97

132

edited Jan 10 at 16:14

Daniel Standage

2,150327

edited Jan 10 at 16:14

Daniel Standage

2,150327

edited Jan 10 at 16:14

Daniel Standage

2,150327

asked Jan 10 at 14:35

yy97

132

asked Jan 10 at 14:35

yy97

132

asked Jan 10 at 14:35

yy97

132

2

$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37

$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43

2

$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27

add a comment |

2

$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37

$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43

2

$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27

Why do you think this is a problem to begin with?

– Devon Ryan♦
Jan 10 at 14:37

I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)

– yy97
Jan 10 at 14:43

What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.

– Bioathlete
Jan 10 at 15:27

add a comment |

2 Answers
2

active

oldest

votes

To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:

The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.

As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

Large numbers of adapter dimers or rRNA may be present in your sample.

But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):

Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.

So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.

answered Jan 10 at 16:12

Daniel Standage

2,150327

add a comment |

FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.

answered Jan 10 at 14:45

Devon Ryan♦

13.4k21539

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6786%2fsequence-duplication-levels-module-still-fails-after-pre-processing-illumina-d%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:

The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.

As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

Large numbers of adapter dimers or rRNA may be present in your sample.

Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.

answered Jan 10 at 16:12

Daniel Standage

2,150327

add a comment |

To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:

The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.

As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

Large numbers of adapter dimers or rRNA may be present in your sample.

Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.

answered Jan 10 at 16:12

Daniel Standage

2,150327

add a comment |

To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:

The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.

As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

Large numbers of adapter dimers or rRNA may be present in your sample.

Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.

answered Jan 10 at 16:12

Daniel Standage

2,150327

To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:

The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.

As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

Large numbers of adapter dimers or rRNA may be present in your sample.

Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.

answered Jan 10 at 16:12

Daniel Standage

2,150327

answered Jan 10 at 16:12

Daniel Standage

2,150327

answered Jan 10 at 16:12

Daniel Standage

2,150327

answered Jan 10 at 16:12

Daniel Standage

2,150327

add a comment |

answered Jan 10 at 14:45

Devon Ryan♦

13.4k21539

add a comment |

answered Jan 10 at 14:45

Devon Ryan♦

13.4k21539

add a comment |

answered Jan 10 at 14:45

Devon Ryan♦

13.4k21539

answered Jan 10 at 14:45

Devon Ryan♦

13.4k21539

answered Jan 10 at 14:45

Devon Ryan♦

13.4k21539

answered Jan 10 at 14:45

Devon Ryan♦

13.4k21539

answered Jan 10 at 14:45

Devon Ryan♦

13.4k21539

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Bioinformatics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dtyjlui