Usage¶
Here is a Bash script with a usage example:
options=(
--snps Red_blood_cell_count-Harst2012-45_SNPs.gwas
--gene-matrix GeneAtlas2004.gct.gz
--gene-intervals NCBIgenes2013.bed.gz
--snp-intervals TGP2011.bed.gz
--null-snps Lango2010.txt.gz
--out out
--slop 10e3
--threads 4
--null-snpsets 0
--min-observations 100
--max-iterations 1e7
)
snpsea ${options[*]}
SNPsea will test SNPs associated with Red blood cell count for tissue-specific expression of linked genes across 79 human tissues in the Gene Atlas expression matrix. Each tissue will be tested up to 10 million times with matched random SNP sets, or testing will stop for a tissue if 100 matched SNP sets achieve a higher specificity score than the user’s SNPs.
Options¶
All input files may optionally be compressed with
`gzip
<http://www.gzip.org/>`__.
Required¶
--snps ARG Text file with SNP identifiers in the first
column. Instead of a file name, you may use
'randomN' with an integer N for a random SNP list
of length N.
--gene-matrix ARG Gene matrix file in GCT format. The Name column
must contain the same gene identifiers as in
--gene-intervals.
--gene-intervals ARG BED file with gene intervals. The fourth column
must contain the same gene identifiers as in
--gene-matrix.
--snp-intervals ARG BED file with all known SNP intervals. The fourth
column must contain the same SNP identifiers as
in --snps and --null-snps.
--null-snps ARG Text file with names of SNPs to sample when
generating null matched or random SNP sets.
These SNPs must be a subset of --snp-intervals.
--out ARG Create output files in this directory. It will be
created if it does not already exist.
Optional¶
--condition ARG Text file with a list of columns in --gene-matrix
to condition on before calculating p-values. Each
column in --gene-matrix is projected onto each
column listed in this file and its projection is
subtracted.
--slop ARG If a SNP interval overlaps no gene intervals,
extend the SNP interval this many nucleotides
further and try again.
[default: 10000]
--threads ARG Number of threads to use.
[default: 1]
--null-snpsets ARG Test this many null matched SNP sets, so you can
compare your results to a distribution of null
results.
[default: 0]
--min-observations ARG Stop testing a column in --gene-matrix after
observing this many null SNP sets with
specificity scores greater or equal to those
obtained with the SNP set in --snps. Increase
this value to obtain more accurate p-values.
[default: 25]
--max-iterations ARG Maximum number of null SNP sets tested for each
column in --gene-matrix. Increase this value to
resolve smaller p-values.
[default: 10000]
Input File Formats¶
--snps ARG
¶
You must provide one or more comma-separated text files. SNP identifiers
must be listed one per line. Lines starting with #
are skipped. If
the file has no header, the first column is assumed to contain SNP
identifiers. Otherwise, SNPsea looks for a column named (case-sensitive)
SNP
or snp
or name
or marker
.
head Red_blood_cell_count-Harst2012-45_SNPs.gwas
# Harst et al. 2012
# doi:10.1038/nature11677
# PMID: 23222517
# 45 SNPs associated with red blood cell count (RBC) taken from Table 1.
# Positions are on hg19. SNPs are included if $P \le 5e-8$.
CHR POS SNP P
chr1 40069939 rs3916164 3e-10
chr1 158575729 rs857684 4e-16
chr1 199007208 rs7529925 8e-09
chr1 248039451 rs3811444 5e-10
Instead of providing a file with SNPs, you may use “randomN” like this:
--snps random20
to sample 20 random SNPs from the ``–snp-intervals`` file.
--gene-matrix ARG
¶
You must provide a single gene matrix that must be in GCT format.
zcat GeneAtlas2004.gct.gz | cut -f1-4 | head
#1.2
17581 79
Name Description Colorectal_Adenocarcinoma Whole_Blood
1 A1BG 115.5 209.5
2 A2M 85 328.5
9 NAT1 499 1578
10 NAT2 115 114
12 SERPINA3 419.5 387.5
13 AADAC 125 252.5
14 AAMP 2023 942.5
--condition ARG
(Optional)¶
You may provide column names present in the ``–gene-matrix`` file, one per line. The matrix will be conditioned on these columns before the analysis is performed to help you identify secondary signals independent of these columns. Binary (0, 1) matrices will not be conditioned.
head conditions.txt
Whole_Blood
--gene-intervals ARG
¶
You must provide gene intervals in BED format with a fourth column that contains the same gene identifiers as those present in the Name column of the ``–gene-matrix`` GCT file. Only the first four columns are used.
zcat NCBIgenes2013.bed.gz | head
chr1 10003485 10045555 64802 NMNAT1
chr1 100111430 100160096 54873 PALMD
chr1 100163795 100164756 100129320 HMGB3P10
chr1 100174205 100232185 391059 FRRS1
chr1 10027438 10027515 100847055 MIR5697
chr1 100308165 100308317 100270894 RPL39P9
chr1 100315632 100389578 178 AGL
chr1 100433941 100435837 730081 LOC730081
chr1 100435344 100492534 23443 SLC35A3
chr1 100503669 100548932 64645 HIAT1
--snp-intervals ARG
¶
SNP linkage intervals must be specified in BED format and include a fourth column with the SNP identifiers. The linkage intervals assigned to the trait-associated SNPs you provide with ``–snps`` are taken from this file.
zcat TGP2011.bed.gz | head
chr1 0 254996 rs113759966
chr1 0 254996 rs114420996
chr1 0 254996 rs114608975
chr1 0 254996 rs115209712
chr1 0 254996 rs116400033
chr1 0 254996 rs116504101
chr1 0 254996 rs12184306
chr1 0 254996 rs12184307
chr1 0 254996 rs138808727
chr1 0 254996 rs139113303
--null-snps ARG
¶
The null SNPs file must have one SNP identifier per line. Only the first column is used. The identifiers must be a subset of the identifiers in ``–snp-intervals``.
zcat Lango2010.txt.gz | head
rs58108140 chr1 10583
rs180734498 chr1 13302
rs140337953 chr1 30923
rs141149254 chr1 54490
rs2462492 chr1 54676
rs10399749 chr1 55299
rs189727433 chr1 57952
rs149755937 chr1 59040
rs77573425 chr1 61989
rs116440577 chr1 63671
Output Files¶
The usage example shown above produces the following output files:
out/
args.txt
condition_pvalues.txt
null_pvalues.txt
snp_condition_scores.txt
snp_genes.txt
args.txt
¶
The command line arguments needed to reproduce the analysis.
cat args.txt
# SNPsea v1.0.2
--snps Red_blood_cell_count-Harst2012-45_SNPs.gwas
--gene-matrix GeneAtlas2004.gct.gz
--gene-intervals NCBIgenes2013.bed.gz
--snp-intervals TGP2011.bed.gz
--null-snps Lango2010.txt.gz
--out out
--score single
--slop 100000
--threads 8
--null-snpsets 0
--min-observations 100
--max-iterations 10000000
Repeat the analysis:
snpsea --args args.txt
condition_pvalues.txt
¶
The p-values representing enrichment of condition-specificity for the given SNPs.
head condition_pvalues.txt | column -t
condition pvalue nulls_observed nulls_tested
Colorectal_Adenocarcinoma 0.933555 280 300
Whole_Blood 0.521595 156 300
BM-CD33+Myeloid 0.159772 111 700
PB-CD14+Monocytes 0.103264 154 1500
PB-BDCA4+Dentritic_cells 0.0606256 187 3100
PB-CD56+NK_cells 0.194009 135 700
PB-CD4+T_cells 0.428571 128 300
PB-CD8+T_cells 0.531561 159 300
PB-CD19+B_cells 0.226819 158 700
null_pvalues.txt
¶
If the argument for ``–snps`` is the name of a file, the p-values for null matched SNP sets. You can compare these null results to the results for your trait-associated SNPs.
If the argument for ``–snps`` is “randomN” where N is some integer, like “random20” the p-values for random unmatched SNP sets, each with N SNPs.
The fifth column is the replicate index. The number of replicates performed is specified with ``–null-snpsets INT``.
head null_pvalues.txt | column -t
ColorectalAdenocarcinoma 0.056 84 1500 0
WholeBlood 0.236667 71 300 0
BM-CD33+Myeloid 0.55 55 100 0
PB-CD14+Monocytes 0.59 59 100 0
PB-BDCA4+Dentritic_Cells 0.59 59 100 0
PB-CD56+NKCells 0.71 71 100 0
PB-CD4+Tcells 0.383333 115 300 0
PB-CD8+Tcells 0.128571 90 700 0
PB-CD19+Bcells 0.168571 118 700 0
BM-CD105+Endothelial 0.386667 116 300 0
snp_genes.txt
¶
Each SNP’s linkage interval and overlapping genes. If a SNP is not found
in the reference file specified with ``–snp-intervals``, then the
name of the SNP will be listed and the other columns will contain
NA
.
head snp_genes.txt | column -t
chrom start end snp n_genes genes
chr4 55364224 55408999 rs218238 0 NA
chr6 139827777 139844854 rs590856 0 NA
NA NA NA rs99999999 NA NA
chr6 109505894 109651220 rs1008084 2 8763,27244
chr10 71089843 71131638 rs10159477 1 3098
chr2 111807303 111856057 rs10207392 1 55289
chr16 88831494 88903796 rs10445033 4 353,2588,9780,81620
chr7 151396253 151417368 rs10480300 1 51422
chr12 4320955 4336783 rs10849023 2 894,57103
chr15 76129642 76397903 rs11072566 4 26263,92912,123591,145957
snp_condition_scores.txt
¶
Each SNP, condition, gene with greatest specificity to that condition, and score for the SNP-condition pair, adjusted for the number of genes overlapping the given SNP’s linkage interval.
head snp_condition_scores.txt | column -t
snp condition gene score
rs9349204 Colorectal_Adenocarcinoma 10817 0.693027
rs9349204 Whole_Blood 896 0.285864
rs9349204 BM-CD33+Myeloid 896 0.236487
rs9349204 PB-CD14+Monocytes 29964 0.340561
rs9349204 PB-BDCA4+Dentritic_cells 29964 0.411727
rs9349204 PB-CD56+NK_cells 896 0.0356897
rs9349204 PB-CD4+T_cells 896 0.38182
rs9349204 PB-CD8+T_cells 896 0.332008
rs9349204 PB-CD19+B_cells 29964 0.255196