“Sequence Duplication Levels” module still fails after pre-processing Illumina data
$begingroup$
I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66
.
How can i solve this problem? Thank You.
illumina data-preprocessing trimming fastqc
$endgroup$
add a comment |
$begingroup$
I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66
.
How can i solve this problem? Thank You.
illumina data-preprocessing trimming fastqc
$endgroup$
2
$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37
$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43
2
$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27
add a comment |
$begingroup$
I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66
.
How can i solve this problem? Thank You.
illumina data-preprocessing trimming fastqc
$endgroup$
I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66
.
How can i solve this problem? Thank You.
illumina data-preprocessing trimming fastqc
illumina data-preprocessing trimming fastqc
edited Jan 10 at 16:14
Daniel Standage
2,150327
2,150327
asked Jan 10 at 14:35
yy97yy97
132
132
2
$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37
$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43
2
$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27
add a comment |
2
$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37
$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43
2
$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27
2
2
$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37
$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37
$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43
$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43
2
2
$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27
$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:
The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.
- As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.
- PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.
- Large numbers of adapter dimers or rRNA may be present in your sample.
But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):
Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.
So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.
$endgroup$
add a comment |
$begingroup$
FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6786%2fsequence-duplication-levels-module-still-fails-after-pre-processing-illumina-d%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:
The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.
- As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.
- PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.
- Large numbers of adapter dimers or rRNA may be present in your sample.
But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):
Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.
So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.
$endgroup$
add a comment |
$begingroup$
To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:
The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.
- As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.
- PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.
- Large numbers of adapter dimers or rRNA may be present in your sample.
But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):
Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.
So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.
$endgroup$
add a comment |
$begingroup$
To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:
The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.
- As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.
- PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.
- Large numbers of adapter dimers or rRNA may be present in your sample.
But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):
Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.
So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.
$endgroup$
To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:
The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.
- As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.
- PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.
- Large numbers of adapter dimers or rRNA may be present in your sample.
But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):
Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.
So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.
answered Jan 10 at 16:12
Daniel StandageDaniel Standage
2,150327
2,150327
add a comment |
add a comment |
$begingroup$
FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.
$endgroup$
add a comment |
$begingroup$
FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.
$endgroup$
add a comment |
$begingroup$
FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.
$endgroup$
FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.
answered Jan 10 at 14:45
Devon Ryan♦Devon Ryan
13.4k21539
13.4k21539
add a comment |
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6786%2fsequence-duplication-levels-module-still-fails-after-pre-processing-illumina-d%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan♦
Jan 10 at 14:37
$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43
2
$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27