Roberto’s Noreboock: FastQC of libraries with additional WGBS.

The results seems like before, with low Qval at the last positions.
Example of library 0501_R1_001_fastq.gz
/home/roberto/Pictures/Screenshot from 2019-05-03 14-31-08.png

Using this data, Trimming is running with conditions:
java -jar /home/Apps/trimmomatic/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 /home/Data_MeOH/Seq_data/0501_R1_001.fastq.gz /home/Data_MeOH/Seq_data/0501_R2_001.fastq.gz 0501_R1_001_paired.fastq.gz 0501_R1_001_unpaired.fastq.gz 0501_R2_001_paired.fastq.gz 0501_R2_001_unpaired.fastq.gz ILLUMINACLIP:TrueSeq3-PE.fa:2:30:10 LEADING:30 TRAILING:30 SLIDINGWINDOW:4:20

I did not consider the min-length option. Last time, considering MINLENGTH:100 was not good. It reduces from 58 million of reads to 35 million…

fastQC before trimming (this library has not the Additional 25% of Seqs)
/home/roberto/Pictures/Screenshot from 2019-05-03 14-37-56.png

After trimming, the result was:
/home/roberto/Pictures/Screenshot from 2019-05-03 14-40-04.png

For this reason, the MINLENGTH is not used.

Roberto’s Notebook: R project

The R project I used for differential expression and plotting is available at:

Roberto’s Notebook: Problem with Stringtie IDs.

Using the program Stringtie for the transcript abundance estimation for libraries from 2 thermal-resistant (TR) and 2 thermal-susceptible (TS) oyster families exposed to oscillatory thermal challenge during 30 days. The program assigned different gene IDs (different from C. gigas gene IDs) specifically in merge step in the output file (/Volumes/toaser/roberto/Hisat_results/stringtie_results/stringtie_merged.gtf) but it has the reference ID (CGI_10000005 for example). As Steven suggested, using finder on stringtie_merged.gtf file, I found the missing CGI gene IDs in the gene expression table (where only 4379 genes had CGI IDs from 60643 total expressed genes).
Doing a test, the stringtie ID “MSTRG.21417” corresponds to DNMT1 gene CGI_10021920 ( It is differentially expressed between TR (samples Os13, Os14, Os15, Os16, Os17 and Os18) and TS families (Os1, Os2, Os3, Os10, Os11 and Os12) at day 30.
DNMT1 dene expression day 30

Where phenotype looks with isoform preferences:
DNMT1 phenotype expression day 30

As is the case for the isoform #2 (considering up to down in the figure) located in:
scaffold1862: transcript_position: 618219-640790 / gene_id “MSTRG.21417”; transcript_id “MSTRG.21417.4”; ref_gene_id “CGI_10021920” with 34 exons is more present in TR. This could suggest… 🙂

Roberto’s Notebook: Differential expression Analysis

After finish the analyses using Hisat program and complemented with stringtie, the differential expression analysis in R using the libraries:
and command lines like:

pheno_data = read.csv(“/Volumes/toaster/roberto/phenodata_day30.csv”)
bg_Cragi = ballgown(dataDir = “/Volumes/toaster/roberto/Hisat_results/stringtie_results/Est_abundance/ballgown30/”, samplePattern = “Os”, pData=pheno_data)
bg_Cragi_filt = subset(bg_Cragi,”rowVars(texpr(bg_Cragi)) >1″,genomesubset=TRUE)

#To look for diff expr between thermal tolerance:

results_transcripts = stattest(bg_Cragi_filt, feature=”transcript”,covariate=”thermal.tolerance”,adjustvars = c(“family”), getFC=TRUE, meas=”FPKM”)
results_genes = stattest(bg_Cragi_filt, feature=”gene”, covariate=”thermal.tolerance”, adjustvars = c(“family”), getFC=TRUE, meas=”FPKM”)

#This is to add gene names and gene IDs to the results_transcripts data frame

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_Cragi_filt), geneIDs=ballgown::geneIDs(bg_Cragi_filt), results_transcripts)

#Then, to sort the results from the smallest P value to the largest:

results_transcripts = arrange(results_transcripts,pval)
results_genes = arrange(results_genes,pval)

#To write the results to a csv file:

write.csv(results_transcripts, “Cragi_transcript_results_D30.csv”, row.names=FALSE)
write.csv(results_genes, “Cragi_gene_results_D30.csv”, row.names=FALSE)

#To identify transcripts and genes with a q value subset(results_transcripts,results_transcripts$qvalsubset(results_genes,results_genes$qvaltropical= c(‘darkorange’, ‘dodgerblue’, ‘hotpink’, ‘limegreen’, ‘yellow’)

fpkm = texpr(bg_Cragi,meas=”FPKM”)
fpkm = log2(fpkm+1)

Screen Shot 2018-08-07 at 3.24.25 PM

##As an example, the observation of differential expression of particular gene (with its isoforms):

plot(fpkm[4295,] ~ pheno_data$thermal.tolerance, border=c(1,2), main=paste(ballgown::geneNames(bg_Cragi)[4295],’ : ‘, >ballgown::transcriptNames(bg_Cragi)[4295]),pch=19, xlab=”thermal.tolerance”, ylab=’log2(FPKM+1)’)
points(fpkm[4295,] ~ jitter(as.numeric(pheno_data$thermal.tolerance)), col=as.numeric(pheno_data$thermal.tolerance))

#Then, the observation of this gene differentially expressed in a single sample:

plotTranscripts(ballgown::geneIDs(bg_Cragi)[4295], bg_Cragi, main=c(‘Gene MSTRG.3053 in sample Os15’), sample=c(‘Os15’))
For this it was necessary to look for the gene id (MSTRG.3053) in matrix file and add it to the command line.

#As a las step. The visualization of expression of isoforms between resistant and susceptible families and the position of these in the genome:

plotMeans(‘MSTRG.3053’, bg_Cragi_filt,groupvar=”thermal.tolerance”,legend=FALSE)

I am still working in the comparison of data obtained with Hisat/stringtie and the data from Trinity.

Roberto’s Notebook: DNA quality

The concentration of DNA samples extracted from gill tissue had the values:
Screen Shot 2018-08-07 at 5.25.58 PM

Using 100 ng, the integrity of DNA was observed in agarose gels.
Screen Shot 2018-08-07 at 5.19.04 PM

The samples 05R1, 35R1, 52R1 and 59R1 do not present a good DNA integrity and samples 9 to 17 have small RNA presence.
The samples 1-4, 1-5, 1-6, 13-3, 13-4 and 13-5 showed in table, were extracted using the E. Z. N. A. Mollusc DNA Kit and were corroborated in agarose gel. The samples had a good DNA integrity (picture not shown). The rest of DNA samples were extracted using the salts protocol.

Roberto’s Notebook: Gene mapping

Testing the program tophat-2.0.13, the data were mapped with the genome (downloaded from: But there were a problem with the GTF file. Steven looked on tophat page and there where a suggested and faster program (hisat2-2.1.0) than tophat.

The hisat program has been downloaded (at /usr/local/bioinformatics/). The support information (Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature protocols, 11(9), 1650.) revels that the GTF file is needed (and it has been downloaded as the Looking back at Tophat post said: Crassostrea_gigas.GCA_000297895.1.24.gtf). Using this file and the genome it was necessary to extract splice sites and exons.
For the moment, the “creating a HISAT2 index” is running and after, the reads will be mapped again.

Hi everybody!

I am glad to be part of this group.

Well, Working on trinity (trinityrna-2.2.0) specifically in abundance estimation of sequences using RSEM package (Before to run differential expression analyses). I had two files called RSEM.isoforms.results and RSEM.genes.results. Both of them are matrices with values of length, effective length of genes, expected count, TPM (transcripts per million) and FPKM (fragments per kilo base million). The TPM and FPKM values are used by the script to estimate the matrices for differential expression, but, there is a problem generating them using the RSEM.genes.results. The part of the script

Screen Shot 2018-07-12 at 2.55.36 PM

At line 242 and 249, writes “NA” for absent gene IDs and this makes an error to create a matrix used for differential expression analyses because the script just recognize numeric values.

Screen Shot 2018-07-12 at 2.30.17 PM

Screen Shot 2018-07-11 at 2.02.48 PM

But rewriting the script changing NA by 0 helps to create the matrices but differential expression analyses had different results than using RSEM.isoforms.results.

Screen Shot 2018-07-12 at 2.35.33 PM

I can’t see the sense for this part. I mean why should be NA instead of 0? even if it needs numeric values and does this affects the differences using genes.results (where I found 10 differentially expressed genes) and isoforms.results (where I found 384 differentially expressed transcripts)? Now I am trying to have the annotation for those 10 genes and compare their ontology with the isoforms annotation.