“Sequence Duplication Levels” module still fails after pre-processing Illumina data












2












$begingroup$


I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.



How can i solve this problem? Thank You.



enter image description here



enter image description here










share|improve this question











$endgroup$








  • 2




    $begingroup$
    Why do you think this is a problem to begin with?
    $endgroup$
    – Devon Ryan
    Jan 10 at 14:37










  • $begingroup$
    I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
    $endgroup$
    – yy97
    Jan 10 at 14:43








  • 2




    $begingroup$
    What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
    $endgroup$
    – Bioathlete
    Jan 10 at 15:27
















2












$begingroup$


I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.



How can i solve this problem? Thank You.



enter image description here



enter image description here










share|improve this question











$endgroup$








  • 2




    $begingroup$
    Why do you think this is a problem to begin with?
    $endgroup$
    – Devon Ryan
    Jan 10 at 14:37










  • $begingroup$
    I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
    $endgroup$
    – yy97
    Jan 10 at 14:43








  • 2




    $begingroup$
    What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
    $endgroup$
    – Bioathlete
    Jan 10 at 15:27














2












2








2





$begingroup$


I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.



How can i solve this problem? Thank You.



enter image description here



enter image description here










share|improve this question











$endgroup$




I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.



How can i solve this problem? Thank You.



enter image description here



enter image description here







illumina data-preprocessing trimming fastqc






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 10 at 16:14









Daniel Standage

2,150327




2,150327










asked Jan 10 at 14:35









yy97yy97

132




132








  • 2




    $begingroup$
    Why do you think this is a problem to begin with?
    $endgroup$
    – Devon Ryan
    Jan 10 at 14:37










  • $begingroup$
    I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
    $endgroup$
    – yy97
    Jan 10 at 14:43








  • 2




    $begingroup$
    What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
    $endgroup$
    – Bioathlete
    Jan 10 at 15:27














  • 2




    $begingroup$
    Why do you think this is a problem to begin with?
    $endgroup$
    – Devon Ryan
    Jan 10 at 14:37










  • $begingroup$
    I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
    $endgroup$
    – yy97
    Jan 10 at 14:43








  • 2




    $begingroup$
    What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
    $endgroup$
    – Bioathlete
    Jan 10 at 15:27








2




2




$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan
Jan 10 at 14:37




$begingroup$
Why do you think this is a problem to begin with?
$endgroup$
– Devon Ryan
Jan 10 at 14:37












$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43






$begingroup$
I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
$endgroup$
– yy97
Jan 10 at 14:43






2




2




$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27




$begingroup$
What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
$endgroup$
– Bioathlete
Jan 10 at 15:27










2 Answers
2






active

oldest

votes


















4












$begingroup$

To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





  • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

  • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

  • Large numbers of adapter dimers or rRNA may be present in your sample.


But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.






share|improve this answer









$endgroup$





















    5












    $begingroup$

    FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.






    share|improve this answer









    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "676"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6786%2fsequence-duplication-levels-module-still-fails-after-pre-processing-illumina-d%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      4












      $begingroup$

      To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




      The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





      • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

      • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

      • Large numbers of adapter dimers or rRNA may be present in your sample.


      But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




      Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




      So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.






      share|improve this answer









      $endgroup$


















        4












        $begingroup$

        To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




        The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





        • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

        • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

        • Large numbers of adapter dimers or rRNA may be present in your sample.


        But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




        Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




        So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.






        share|improve this answer









        $endgroup$
















          4












          4








          4





          $begingroup$

          To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




          The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





          • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

          • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

          • Large numbers of adapter dimers or rRNA may be present in your sample.


          But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




          Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




          So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.






          share|improve this answer









          $endgroup$



          To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




          The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





          • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

          • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

          • Large numbers of adapter dimers or rRNA may be present in your sample.


          But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




          Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




          So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 10 at 16:12









          Daniel StandageDaniel Standage

          2,150327




          2,150327























              5












              $begingroup$

              FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.






              share|improve this answer









              $endgroup$


















                5












                $begingroup$

                FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.






                share|improve this answer









                $endgroup$
















                  5












                  5








                  5





                  $begingroup$

                  FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.






                  share|improve this answer









                  $endgroup$



                  FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 10 at 14:45









                  Devon RyanDevon Ryan

                  13.4k21539




                  13.4k21539






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Bioinformatics Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6786%2fsequence-duplication-levels-module-still-fails-after-pre-processing-illumina-d%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Human spaceflight

                      Can not write log (Is /dev/pts mounted?) - openpty in Ubuntu-on-Windows?

                      張江高科駅