Control sequence counts in the FASTQ reads

Summary

Using the TagDust 2 program (http://sourceforge.net/projects/tagdust/), the number of reads matching reference control sequences for the spikes, the rRNAs, and the Nextera linkers are counted, normalised, and saved in a table. This table contains the following columns:

The unit for expression levels is counts per million (CPM). Given that there is a 3′ bias in the RNA-seq protocol used here, it is not advisable to normalise by sequence length in addition.

In the last three runs, the detection levels of spikes and rRNA are lower than in the first two runs (see the plot Overview of the reference sequences detected with TagDust2).

The quantity of primer artefacts (called Nextera below), is constant between runs, at roughly 10 % of the total number of reads. Note that TagDust only inspects the first 32 bases of the reads, but that should be enough, and also avoids calling artefact a read that has a short insert and finishes in the linker region.

Random inspection of file (matching the pattern links_to_fastq/*A10*R1* links_to_fastq/*B04*R1* links_to_fastq/*F09*R1*) for presence of reads aligning to the Mycoplasma hominis genome (NC_013511) did not suggest contamination of the cells.

Datasets

Files

cat Nextera.fa rRNA.fa ArrayControl.nopolyA.fa hpv18.fa hpv18-as.fa > spikes.fa

Extraction

Detection of the spikes using TagDust.

To run this section, install TagDust version 2.06 in the current directory, unpack and compile it. (This script can not use higher version of TagDust because they do not allow for the redirection of the output sequences to /dev/null/.), then, edit the RMarkdown code and remove the eval=FALSE statement. The TagDust counting with 8 CPU cores takes approximately 1 day.

TAGDUST=./tagdust-2.06/tagdust
$TAGDUST | grep Copyright 
SPIKES=spikes.fa 
for FILE in ../DDBJ/*fastq.bz2
do
  $TAGDUST -t 8 -l tagdust -o /dev/null -ref $SPIKES $FILE
done

Loops to create a file called spikes.txt, where each line gives the number of reads matching one given spike in one given file.

for LIBRARY in 1772-062-248 1772-062-249 1772-064-103 1772-067-038 1772-067-039
do
  for ROW in A B C D E F G H
  do
    for COLUMN in 01 02 03 04 05 06 07 08 09 10 11 12
    do
      for READ in 1 2
      do
        grep -e rRNA -e SPIKE -e input -e complex -e Nextera -e HPV tagdust/${LIBRARY}_${ROW}${COLUMN}.${READ}* |
          perl -pe "s/^/$LIBRARY\tR$READ\t$ROW$COLUMN\t/"
      done
    done
  done
done > spikes.txt

head spikes.txt
## grep: tagdust/1772-064-103_A09.1*: No such file or directory
## grep: tagdust/1772-064-103_A09.2*: No such file or directory
## grep: tagdust/1772-064-103_B09.1*: No such file or directory
## grep: tagdust/1772-064-103_B09.2*: No such file or directory
## grep: tagdust/1772-064-103_C09.1*: No such file or directory
## grep: tagdust/1772-064-103_C09.2*: No such file or directory
## grep: tagdust/1772-064-103_D09.1*: No such file or directory
## grep: tagdust/1772-064-103_D09.2*: No such file or directory
## grep: tagdust/1772-064-103_E09.1*: No such file or directory
## grep: tagdust/1772-064-103_E09.2*: No such file or directory
## grep: tagdust/1772-064-103_F09.1*: No such file or directory
## grep: tagdust/1772-064-103_F09.2*: No such file or directory
## grep: tagdust/1772-064-103_G09.1*: No such file or directory
## grep: tagdust/1772-064-103_G09.2*: No such file or directory
## grep: tagdust/1772-064-103_H09.1*: No such file or directory
## grep: tagdust/1772-064-103_H09.2*: No such file or directory
## 1772-062-248 R1  A01 1964004 total input reads
## 1772-062-248 R1  A01 9130    low complexity
## 1772-062-248 R1  A01 1   Nextera_501
## 1772-062-248 R1  A01 217407  Nextera_701
## 1772-062-248 R1  A01 48  Nextera_702
## 1772-062-248 R1  A01 453 Nextera_703
## 1772-062-248 R1  A01 22  Nextera_704
## 1772-062-248 R1  A01 34  Nextera_705
## 1772-062-248 R1  A01 57  Nextera_706
## 1772-062-248 R1  A01 18  Nextera_707

Construction of a table in R

library(reshape)
library(ggplot2)
## Loading required package: methods
spikes <- read.delim('spikes.txt', sep="\t", col.names=c('Run', 'read', 'cell', 'value', 'count'), head=FALSE)
ggplot(
  data=spikes,
  aes(x=count, y=value, colour=read)) + geom_boxplot() + coord_flip() + scale_y_log10('Raw count')