Default permutation pass

To perform a permutation based analysis of the example data set using a total of 1,000 permutations, use:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.default.txt.gz

This produces this output on the screen when everything works correctly:


You can increase the number of permutations to 10,000 using:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 10000 --out permutations.10K.txt.gz

You can also perform an adaptive permutation pass on the example data set (i.e. with a number of permutations that varies in function of the significance). To run between 100 and 100,000 permutations, use:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 100 100000 --out permutations.adaptive.txt.gz

Association testing parameters

Associations between genotype dosages and phenotype quantifications are measured with linear regressions (here), similar to the R/lm function. This model assumes that phenotypes are normally distributed. If your phenotype quantifications are not normally distributed, you can force them to match normal distributions N(0, 1) by using:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.quantile.txt.gz --normal

To change the cis-window size (i.e. the maximal distance spanned by phenotype-variant pairs to be considered for testing) from default value 1e6 bp to 2e6 bp, use:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.window2Mb.txt.gz --window 2e6

To change the seed of the random number generator, which is particularly useful to replicate an analysis, use:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.seed.txt.gz --seed 123456789

To add covariates in the linear regressions used for association testing, use:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.txt.gz --cov covariates.txt.gz

The file covariates.txt.gz needs to be formatted as described here.

Excluding/Including data

To exclude samples, variants, phenotypes or covariates from the analysis, you can use one of these options:

  1. To exclude samples: --exclude-samples file.exc
  2. To exclude variants: --exclude-sites file.exc
  3. To exclude phenotypes: --exclude-phenotypes file.exc
  4. To exclude caovariates: --exclude-covariates file.exc

For instance, if you want to ignore 3 samples when analyzing the example data set, first create a text file containing the IDs of the samples to be excluded, called here file.exc:

UNR1
UNR2
UNR3

Then, run the following command line:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.sub.txt.gz --exclude-samples file.exc

Similarly to these 4 options for data exclusion, you can also specify the set of samples, variants, phenotypes and covariates to be included in the analysis using the options: --include-samples, --include-sites, --include-phenotypes and --include-covariates, respectively.

Parallelization

As a first way to facilitate parallelization on compute cluster, we developed an option to run the analysis for just a chunk of molecular phenotypes. The region of interest is specified with the standard chr:start-end format. Then, FastQTL extracts all phenotypes in this region, then all genotypes given the specified cis-window size and finally perform the analysis for this data subset. For instance, to run QTL mapping only for molecular phenotypes on chr22 with coordinates between 18Mb and 20Mb, use:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region chr22:18000000-20000000 --permute 1000 --out permutations.18M20M.txt.gz This option requires both the genotype and phenotype files to be indexed with TABIX!

This strategy is quite similar to what is commonly used for genotype imputation, where only small chunks of data are imputed one at a time in seperate jobs. However in practice, it is usually quite complicated to properly split the genome into a given number of chunks with correct coordinates. To facilitate this, we embedded all coordinates into a chunk-based system such that you only need to specify the chunk index you want to run. Then, splitting the genome into chunks, extraction of data, and analysis are automatically performed. For instance, to run analysis on chunk number 25 when splitting the example data set (i.e. genome) into 30 chunks, just run:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.chunk25.txt.gz --chunk 25 30

If you want to submit the whole analysis into 30 jobs on your compute cluster, just run:

for j in $(seq 1 30); do
     echo "fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.txt.gz --chunk $j 30" | qsub
done

Here qsub needs to be changed according to the job submission system used (bsub, psub, etc...).

In this simple example, we only split the data into 30 chunks. However, a realistic whole genome analysis would require to split the data in 200, 500 or even 1,000 chunks to fully use the capabilities of a modern compute cluster.

Finally, we also developed a slightly different parallelization option that, this time, allows to generate all required command lines and write into a file. Let take the same example as before, that is splitting the analysis into 10 jobs:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations --commands 10 commands.10.txt

Now if you look at the file commands.10.txt, you'll get this:

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:17517460-20748406 --region 22:17517460-20748406
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:36424473-39052635 --region 22:36424473-39052635
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:24407642-30163001 --region 22:24407642-30163001
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:42017123-45704850 --region 22:42017123-45704850
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:20792146-22307210 --region 22:20792146-22307210
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:39052641-39915701 --region 22:39052641-39915701
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:30163352-36044443 --region 22:30163352-36044443
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:45809500-51222092 --region 22:45809500-51222092
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:22337213-24322661 --region 22:22337213-24322661
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:39928860-42017101 --region 22:39928860-42017101

Where, region coordinates are automatically determined given the total number of chunks provided. Then, you can submit all these commands on a cluster using for example:

while read c; do
     echo $c | qsub
done | commands.10.txt

Output file format of a permutation pass

Once the analysis completed and all output file collected, you can easily concat and compress all files into a single output file to ease downstream analysis with:

zcat permutations.chunk*.txt.gz | gzip -c > permutations.all.chunks.txt.gz

After having performed a permutation pass on the data and concatenating the output files, you end up with a file with 10 columns and M lines with M being the total number of tested molecular phenotypes. For instance, if you tested 1,000 molecular phenotypes, it means that you'll get 1,000 lines in the output files. Hereafter a short example:

ENSG00000237438.1 7966 0.93436 2871.25 376.072 snp_22_17542810 25350 5.48728e-13 0.000999001 4.38131e-09
ENSG00000177663.8 8165 0.936488 3667.52 392.593 snp_22_17497295 -68549 0.000167489 0.376623 0.361675
ENSG00000183307.3 8279 1.03157 2719.43 369.406 snp_22_17587975 -14282 3.11324e-08 0.000999001 6.78199e-05
ENSG00000069998.8 8466 1.04834 2472.36 358.392 snp_22_17639837 -6340 5.01155e-11 0.000999001 1.26653e-07
ENSG00000093072.10 8531 0.934923 5838.18 406.904 snp_22_17699299 -39826 9.028e-05 0.246753 0.245706

In this file, the 10 columns correspond to:

  1. ID of the tested molecular phenotype (in this particular case, the gene ID)
  2. Number of variants tested in cis for this phenotype
  3. MLE of the shape1 parameter of the Beta distribution
  4. MLE of the shape2 parameter of the Beta distribution
  5. Dummy [To be described later]
  6. ID of the best variant found for this molecular phenotypes (i.e. with the smallest p-value)
  7. Distance between the molecular phenotype - variant pair
  8. The nominal p-value of association that quantifies how significant from 0, the regression coefficient is
  9. The slope associated with the nominal p-value of association [only in version > v2-184]
  10. A first permutation p-value directly obtained from the permutations with the direct method. This is basically a corrected version of the nominal p-value that accounts for the fact that multiple variants are tested per molecular phenotype.
  11. A second permutation p-value obtained via beta approximation. We advice to use this one in any downstream analysis.

Checking that the experiment went well

To check that the beta approximated permutation p-values are well estimated, load the data in R and make the following plot:

R> d = read.table("permutations.all.chunks.txt.gz", hea=F, stringsAsFactors=F)
R> colnames(d) = c("pid", "nvar", "shape1", "shape2", "dummy", "sid", "dist", "npval", "ppval", "bpval")
R> plot(d$ppval, d$bpval, xlab="Direct method", ylab="Beta approximation", main="Check plot")
R> abline(0, 1, col="red")

The expectation is to get all the points along the diagonal as on the plot above. This shows that unsignificant beta pproximated p-values are well estimated and therefore that the estimations went well.

Controlling for multiple phenotypes being tested

Once obtained all permutation p-values for all the tested molecular phenotypes, we need now to account for the fact that many molecular phenotypes are tested whole genome in order to determine the significant QTLs. To do so, we propose here 3 approaches; from the most to the least stringent. First, load the data in R using:

R> d = read.table("permutations.all.chunks.txt.gz", hea=F, stringsAsFactors=F)
R> colnames(d) = c("pid", "nvar", "shape1", "shape2", "dummy", "sid", "dist", "npval", "ppval", "bpval")

Bonferroni correction

Look here for some details on the correction and here for some details on the R function used in this example. To apply the Bonferroni correction on the permutation p-values derived from beta approximation, use:

R> d$bonferroni = p.adjust(d$bpval, method="bonferroni")

Then, you can extract all significant MP-QTL pairs at α=0.05 and write them to a file permutations.all.chunks.bonferroni.txt using:

R> write.table(d[which(d$bonferroni <= 0.05), c(1,6)], "permutations.all.chunks.bonferroni.txt", quote=F, row.names=F, col.names=T)

Benjamini & Hochberg correction

Look here for some details on the correction and here for some details on the R function used in this example. To apply the Benjamini & Hochberg correction on the permutation p-values derived from beta approximation, use:

R> d$bh = p.adjust(d$bpval, method="fdr")

Then, you can extract all significant MP-QTL pairs with a 10% false discovery rate (FDR) for instance and write them to a file permutations.all.chunks.benjamini.txt using:

R> write.table(d[which(d$bh <= 0.10), c(1,6)], "permutations.all.chunks.benjamini.txt", quote=F, row.names=F, col.names=T)

Storey & Tibshirani correction

Look here for some details on the correction and here for some details on the R function used in this example. To install the R/qvalue package on your system, look here. Then to apply the Storey & Tibshirani correction on the permutation p-values derived from beta approximation, use:

R> library(qvalue) R> d$st = qvalue(d$bpval)$qvalues

Then, you can extract all significant MP-QTL pairs with a 10% false discovery rate (FDR) for instance and write them to a file whole_analysis.storey.permutations using:

R> write.table(d[which(d$st <= 0.10), c(1,6)], "permutations.all.chunks.storey.txt", quote=F, row.names=F, col.names=T)