# Default permutation pass

To perform a permutation based analysis of the example data set using a total of 1,000 permutations, use:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000
```**--permute 1000** --out permutations.default.txt.gz

This produces this output on the screen when everything works correctly:

You can increase the number of permutations to 10,000 using:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000
```**--permute 10000** --out permutations.10K.txt.gz

You can also perform an adaptive permutation pass on the example data set (i.e. with a number of permutations that varies in function of the significance). To run between 100 and 100,000 permutations, use:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000
```**--permute 100 100000** --out permutations.adaptive.txt.gz

# Association testing parameters

Associations between genotype dosages and phenotype quantifications are measured with linear regressions (here), similar to the R/lm function. This model assumes that phenotypes are normally distributed. If your phenotype quantifications are not normally distributed, you can force them to match normal distributions N(0, 1) by using:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.quantile.txt.gz
```**--normal**

To change the cis-window size (i.e. the maximal distance spanned by phenotype-variant pairs to be considered for testing) from default value 1e6 bp to 2e6 bp, use:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.window2Mb.txt.gz
```**--window 2e6**

To change the seed of the random number generator, which is particularly useful to replicate an analysis, use:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.seed.txt.gz
```**--seed 123456789**

To add covariates in the linear regressions used for association testing, use:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.txt.gz
```**--cov covariates.txt.gz**

The file **covariates.txt.gz** needs to be formatted as described **here**.

# Excluding/Including data

To exclude samples, variants, phenotypes or covariates from the analysis, you can use one of these options:

- To exclude samples:
*--exclude-samples file.exc* - To exclude variants:
*--exclude-sites file.exc* - To exclude phenotypes:
*--exclude-phenotypes file.exc* - To exclude caovariates:
*--exclude-covariates file.exc*

For instance, if you want to ignore 3 samples when analyzing the example data set, first create a text file containing the IDs of the samples to be excluded, called here *file.exc*:

```
UNR1
```

UNR2

UNR3

Then, run the following command line:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --region 22:17000000-18000000 --permute 1000 --out permutations.sub.txt.gz
```**--exclude-samples file.exc**

Similarly to these 4 options for data exclusion, you can also specify the set of samples, variants, phenotypes and covariates to be included in the analysis using the options: *--include-samples, --include-sites, --include-phenotypes and --include-covariates*, respectively.

# Parallelization

As a first way to facilitate parallelization on compute cluster, we developed an option to run the analysis for just a chunk of molecular phenotypes.
The region of interest is specified with the standard **chr:start-end** format.
Then, FastQTL extracts all phenotypes in this region, then all genotypes given the specified cis-window size and finally perform the analysis for this data subset.
For instance, to run QTL mapping only for molecular phenotypes on chr22 with coordinates between 18Mb and 20Mb, use:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz
```**--region chr22:18000000-20000000** --permute 1000 --out permutations.18M20M.txt.gz

This strategy is quite similar to what is commonly used for genotype imputation, where only small chunks of data are imputed one at a time in seperate jobs. However in practice, it is usually quite complicated to properly split the genome into a given number of chunks with correct coordinates. To facilitate this, we embedded all coordinates into a chunk-based system such that you only need to specify the chunk index you want to run. Then, splitting the genome into chunks, extraction of data, and analysis are automatically performed. For instance, to run analysis on chunk number 25 when splitting the example data set (i.e. genome) into 30 chunks, just run:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.chunk25.txt.gz
```**--chunk 25 30**

If you want to submit the whole analysis into 30 jobs on your compute cluster, just run:

```
for j in
```**$(seq 1 30)**; do

echo "fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.txt.gz **--chunk $j 30**" | qsub

done

Here **qsub** needs to be changed according to the job submission system used (bsub, psub, etc...).

Finally, we also developed a slightly different parallelization option that, this time, allows to generate all required command lines and write into a file. Let take the same example as before, that is splitting the analysis into 10 jobs:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations
```**--commands 10 commands.10.txt**

Now if you look at the file **commands.10.txt**, you'll get this:

```
fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:17517460-20748406 --region 22:17517460-20748406
```

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:36424473-39052635 --region 22:36424473-39052635

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:24407642-30163001 --region 22:24407642-30163001

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:42017123-45704850 --region 22:42017123-45704850

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:20792146-22307210 --region 22:20792146-22307210

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:39052641-39915701 --region 22:39052641-39915701

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:30163352-36044443 --region 22:30163352-36044443

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:45809500-51222092 --region 22:45809500-51222092

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:22337213-24322661 --region 22:22337213-24322661

fastQTL --vcf genotypes.vcf.gz --bed phenotypes.bed.gz --permute 1000 --out permutations.22:39928860-42017101 --region 22:39928860-42017101

Where, region coordinates are automatically determined given the total number of chunks provided. Then, you can submit all these commands on a cluster using for example:

```
while read c; do
```

echo $c | qsub

done **| commands.10.txt**

# Output file format of a **permutation** pass

Once the analysis completed and all output file collected, you can easily concat and compress all files into a single output file to ease downstream analysis with:

```
zcat permutations.chunk*.txt.gz | gzip -c > permutations.all.chunks.txt.gz
```

After having performed a permutation pass on the data and concatenating the output files, you end up with a file with 10 columns and M lines with M being the total number of tested molecular phenotypes. For instance, if you tested 1,000 molecular phenotypes, it means that you'll get 1,000 lines in the output files. Hereafter a short example:

```
ENSG00000237438.1 7966 0.93436 2871.25 376.072 snp_22_17542810 25350 5.48728e-13 0.000999001 4.38131e-09
```

ENSG00000177663.8 8165 0.936488 3667.52 392.593 snp_22_17497295 -68549 0.000167489 0.376623 0.361675

ENSG00000183307.3 8279 1.03157 2719.43 369.406 snp_22_17587975 -14282 3.11324e-08 0.000999001 6.78199e-05

ENSG00000069998.8 8466 1.04834 2472.36 358.392 snp_22_17639837 -6340 5.01155e-11 0.000999001 1.26653e-07

ENSG00000093072.10 8531 0.934923 5838.18 406.904 snp_22_17699299 -39826 9.028e-05 0.246753 0.245706

In this file, the 10 columns correspond to:

- ID of the tested molecular phenotype (in this particular case, the gene ID)
- Number of variants tested in cis for this phenotype
- MLE of the shape1 parameter of the Beta distribution
- MLE of the shape2 parameter of the Beta distribution
- Dummy [To be described later]
- ID of the best variant found for this molecular phenotypes (i.e. with the smallest p-value)
- Distance between the molecular phenotype - variant pair
- The
**nominal**p-value of association that quantifies how significant from 0, the regression coefficient is - The
**slope**associated with the nominal p-value of association [only in version > v2-184] - A first
**permutation**p-value directly obtained from the permutations with the direct method. This is basically a corrected version of the nominal p-value that accounts for the fact that multiple variants are tested per molecular phenotype. - A second
**permutation**p-value obtained via beta approximation. We advice to use this one in any downstream analysis.

# Checking that the experiment went well

To check that the beta approximated permutation p-values are well estimated, load the data in R and make the following plot:

```
R> d = read.table("permutations.all.chunks.txt.gz", hea=F, stringsAsFactors=F)
```

R> colnames(d) = c("pid", "nvar", "shape1", "shape2", "dummy", "sid", "dist", "npval", "ppval", "bpval")

R> plot(d$ppval, d$bpval, xlab="Direct method", ylab="Beta approximation", main="Check plot")

R> abline(0, 1, col="red")

The expectation is to get all the points along the diagonal as on the plot above. This shows that unsignificant beta pproximated p-values are well estimated and therefore that the estimations went well.

# Controlling for multiple phenotypes being tested

Once obtained all permutation p-values for all the tested molecular phenotypes, we need now to account for the fact that many molecular phenotypes are tested whole genome in order to determine the significant QTLs. To do so, we propose here 3 approaches; from the most to the least stringent. First, load the data in R using:

```
R> d = read.table("permutations.all.chunks.txt.gz", hea=F, stringsAsFactors=F)
```

R> colnames(d) = c("pid", "nvar", "shape1", "shape2", "dummy", "sid", "dist", "npval", "ppval", "bpval")

### Bonferroni correction

Look here for some details on the correction and here for some details on the R function used in this example. To apply the Bonferroni correction on the permutation p-values derived from beta approximation, use:

```
```**R>** d$bonferroni = p.adjust(d$bpval, method="bonferroni")

Then, you can extract all significant MP-QTL pairs at α=0.05 and write them to a file **permutations.all.chunks.bonferroni.txt** using:

```
```**R>** write.table(d[which(d$bonferroni <= 0.05), c(1,6)], "permutations.all.chunks.bonferroni.txt", quote=F, row.names=F, col.names=T)

### Benjamini & Hochberg correction

Look here for some details on the correction and here for some details on the R function used in this example. To apply the Benjamini & Hochberg correction on the permutation p-values derived from beta approximation, use:

```
```**R>** d$bh = p.adjust(d$bpval, method="fdr")

Then, you can extract all significant MP-QTL pairs with a 10% false discovery rate (FDR) for instance and write them to a file **permutations.all.chunks.benjamini.txt** using:

```
```**R>** write.table(d[which(d$bh <= 0.10), c(1,6)], "permutations.all.chunks.benjamini.txt", quote=F, row.names=F, col.names=T)

### Storey & Tibshirani correction

Look here for some details on the correction and here for some details on the R function used in this example. To install the R/qvalue package on your system, look here. Then to apply the Storey & Tibshirani correction on the permutation p-values derived from beta approximation, use:

```
```**R>** library(qvalue)
**R>** d$st = qvalue(d$bpval)$qvalues

Then, you can extract all significant MP-QTL pairs with a 10% false discovery rate (FDR) for instance and write them to a file **whole_analysis.storey.permutations** using:

```
```**R>** write.table(d[which(d$st <= 0.10), c(1,6)], "permutations.all.chunks.storey.txt", quote=F, row.names=F, col.names=T)