IMPORTANT UPDATE:

MMAP has been retooled to use BLAS and LAPACK libraries. The current binary is compiled with the Intel Math Kernel library. To ensure compatibility only a static executable is currently available. As part of this upgrade some command line options have changed, which will require modification of scripts written to run on previous versions. Options that have been changed are in red. Full documentation and new webpage is under development. Note that documentation may change frequently in the next few months as updates, additions and corrections are added. Most of the updated documentation is found in the pdf links below.

MMAP Intel MKL Download

Latest MMAP binary is mmap.2016_12_06.intel.

Pedigree Files

The pedigree file is a comma delimited file with header containing the first 5 tokens:
	PED  	- Pedigree ID
	EGO	- Individual ID
	FA	- Father ID
	MO	- Mother ID
	SEX	- sex (1 = male, 2 = female)
The ID is alphanumeric. MMAP expects IDs to be unique across pedigrees as there is no pedigree information in the phenotype file. Missing parents are coded as 0. Parents that do not have records as individuals are ignored and single missing parents are allowed. Unknown sex is not supported. There are two additional columns (not ordered) that will be interpreted by MMAP, if present.
	MZTWIN	- non-zero integer as group identifier of genetically identical individuals
	COHORT	-	non-zero integer defines group effect (see section on variance components)
The pedigree is assumed to be in ancestral order which means parents are listed before offspring. Ancestral order is used for efficient computation of the relationship matrix. MMAP currently enforces ancestral order but allows parents to be listed that do have their own records. If errors are encountered, MMAP exists and errors are output to the file pedigree.csv, which can used to correct the problems.   --ped <csv pedigree file>

Additional pedigree options

--single_pedigree Instructs MMAP to ignore the pedigree ID and interpret the pedigree file as a single pedigree. This option is useful when creating covariance matrices that include between-pedigree values such as genetic similarity through genotype data.

Example pedigree contains two pedigrees. In the first pedigree there are two sets of genetically identical subjects: MGM and PGF, and 1,2 and 3.

Phenotype Files

Phenotypes are stored in a comma delimited file with header. The first column is assumed to be individual ID (independent of actual header token). Currently only numerical traits and covariates are supported; they are read in as real numbers, but can be coded as integers in the file. Missing data is coded as blank. If individual id is not in the first column, the --phenotype_id option must be used.  

--phenotype_filename <phenotype file> > : file to be read in

--trait <trait1> <trait2> ... <traitN> : list of traits to be analyzed. Currently only a single trait is supported. Multitrait analysis is being implemented.

--phenotype_id <header token> Optional:

specifies which column contains individual ID

Example phenotype file contains data for 5 subjects from the pedigree file. The subject identifier SUBJECT and trait BMI. Since SUBJECT is not in the first column, SPAN CLASS="commandline"> --phenotype_id SUBJECT is required. Subject PGM is missing BMI measurement.

Covariate Files

Covariate files are optional comma delimited files with header line with covariate name. Covariate files also assume the first column contains the individual ID unless the --phenotype_id is used. MMAP first searches the phenotype file for specified covariates, then the covariates files in order given. Thus, the first instance of the covariate is used if present in multiple files.

--covariates <cov1> <cov2> … <covN> A list of covariates to be included in the analysis. If a covariate is not found in the phenotype file and/or covariate files, then the program exits. Individuals missing any covariate value are excluded from the analysis.

--covariate_filename <file1> < file2> … <fileN> Covariates specified by the --covariates option will be searched for in first the phenotype_file then sequentially through the list of covariate files. Thus, this option is only needed if covariates are not in the phenotype_file. The first file containing the covariate is used, thus there is no merging of within-covariate information across files. Covariate files can contain covariates with string values (so do not need to remove), but only covariates coded as numerical are currently supported. Missing data is coded as blank, so appe<ar as “,,” in the file. Future options will include expanding a covariate into categorical values; for example, a single column containing four seasons would be expanded as 3 covariates. If no covariate files are given, then MMAP searches in the file specified by --phenotype_filename . If a covariate is not found, MMAP exits.

Example covariate file contains data for 4 subjects from the pedigree file. The individual identifier SUBJECT must be the same as in the phenotype file. Subject 11 is missing AGE covariate.

NOTES

Genotype Files

MMAP uses binary only files for genotype analysis. For genome-wide association analysis the marker-by-subject (MxS) format is most efficient. MMAP provides utilities to convert comma separated text files with header to binary. The basic MxS format is as follows:
	SNPNAME  	- Marker identifier
	RSNUM	- rsnumber (or second SNP identifier)
	CHR	- numeric,X,Y,XY,MT
	POS	- Position in base pairs
	STRAND	- +,- or blank
	NON-CODED_ALLELE - homozygote has dosage 0
	EFFECT_ALLELE	- homozygote has dosage 2

Only SNPNAME is required to be present, but default values of 0 for CHR and POS, ? for STRAND and 1 and 2 for the NON-CODED_ALLELE and EFFECT_ALLELE, respectively, will be entered if the tokens are not found. The file can contain any number of additional columns but must come before the list of individual IDs. These additional columns can be included in the genotype file and be used as filters or annotation by referencing the token in the header line in the appropriate command. Genotypes are stored in a variety of formats. Observed data is stored using 16 codes that represent phased and unphased states, partially typed states, and missing values. Thus, observed data is stored in a single byte. Imputed dosages can be stored as one or two bytes by scaling the value or as a double. One byte stores multiply the dosage by 100 so get 2 decimal place accuracy; two bytes multiplies by 10,000 so 4 place accuracy. If the original data is imputed dosage and not observed genotypes then the appropriate command line must be added to tell MMAP what data to expect. MMAP assumes comma-separated files (csv), but accepts space delimited also by adding --genotype_space_delimiter to the command line. MMAP accepts gzipped genotype files as input also. No additional command lines are necessary.

Example genotype file contains data for 5 subjects from the pedigree file using Affymetrix SNP ids as SNPNAME. The token STRAND is missing so the value will be set to ?. Marker SNP_A-2236359 has no rsnumber and is a deletion. Missing genotype values are coded as 3. There are two additional columns GENE and GROUP that can be included in the binary genotype file if desired. The following two commands are required to convert the text file to binary.

--write_binary_genotype_file --csv_input_filename <input file> --binary_output_filename <output file> converts the input file into MMAP binary format in the output file.

--num_skip_fields The number of columns to skip before the first subject ID. This command is required.

--genotype_dosage_short : stores dosage as 10000*value, so suitable for dosage with 4 decimal place accuracy

--genotype_dosage_char :stores dosage as 100*value, so suitable for dosage with 2 decimal place accuracy

--genotype_dosage_double : stores dosage as value in file

--genotype_space_delimiter: add if genotype file is space rather than comma delimited

--additional_marker_attributes <token1> <type1> … <tokenN> <typeN> type is C for character string, D is double, I is integer. This option will include the additional columns from the genotype text file, if present.

--output_marker_attribute <token1> <token2> … <tokenN> The output file for GWAS will contain the standard tokens. Additional columns can be included using this option.

Options for marker analysis

--chromosome <chrA> … <chrN> Analysis is restricted to chromosomes listed in the command line. Currently limited to autosomes Non-autosome chromosomes are designated by standard nomenclature: X,Y,XY, and MT.

--genomic_region <chrA> <bp startA> <bp_endB> … <chrN> <bp startN> <bp_endN> Analysis is restricted to genomic regions specified by chromosome and bp window. Base pair values. –chromosome 4 would be the same as –genomic_region 4 0 5000000.

--marker_set <text file> Analysis is restricted to the set of markers in the marker file. MMAP searches the SNPNAME and RSNUM columns to match the markers listed. NO header in the file

Analysis set

MMAP takes the intersection of subjects in the pedigree and phenotype and also covariate and genotype files, if present, to generate the set of subjects used for analysis. Subjects with missing phenotype or covariate values are dropped. MMAP has an option to specify a subject set file, that if present, will also be included in the intersection. In the phenotype and covariate examples above the analysis set would for BMI with covariates AGE would be subject F from pedigree 1 and subject 12 from pedigree DM. If the genotype file is included then the analysis set contains only subject F.

--subject_set <input file> Single column file with no header that will control the individuals included in the analysis. This set is intersected with individuals with data from the phenotype and covariate file

Running MMAP

MMAP requires that the binary relationship matrix be computed before any phenotype or genotype analysis. This matrix is then read in for other analyses.

--compute_binary_relationship_matrix_by_groups <output file> --group_size <value> Computes the pedigree specific relationship matrix (twice kinship) and stores in binary format to be read during analysis. Pre-computing this matrix is required to avoid recomputation for each analysis. The algorithm computes the matrix by groups to handle memory requirements of large pedigrees, as the matrix requires NxNx8 bytes of memory, where N is the number of individual. The binary file size is order NxNx4 bytes, thus depending on the application, it may be more efficient to restrict to the calculation to phenotyped individuals rather than the full pedigree by adding the subject_set option below. For example, when analyzing a Holstein pedigree of size ~240K with 60K genotyped animals only the 60K x 60K matrix was generated. For human pedigrees it is recommended to generate the full matrix as they rarely reach this size, even when the single_pedigree option is used.

Thus run the following command. Default group size is 1000.

mmap --ped <pedfile> --compute_binary_relationship_matrix_by_groups --binary_output-filname <output file> --group_size <value>

Output Options and Files:

--file_suffix <string> Adds string to output files to prevent clobbering from different analyses in the same directory

--all_output Generates additional output files that contain the likelihood values over the of h2 values, transformed phenotypes files.

<trait>.<file_suffix>.poly.cov.csv Contains trait statistics, number of observations, h2 estimate and p-value, beta, standard error and p-value of the fixed effects (uses t-test), percent variation the fixed effects account for, estimates of the total variance, additive variance, and error variance with standard errors. Also included is the standard errors of h2 estimate and standard error of the standard_deviation estimate (square root of variance estimate). MMAP uses the expected values of the information matrix, so no covariances between fixed and random effects are generated. Computing the matrix at the MLE estimates may be added in the future.

<trait>.<file_suffix>.poly.cor.csv The correlation/covariance between the beta estimators using the Fisher information matrix. The diagonal if the matrix is the standard deviation. Since expected values are used in the information matrix the covariance between fixed and random effects are assumed zero.

<trait>.<file_suffix>.poly.model.csv Contains the individuals used in the analysis, the observed phenotype, covariate values, fitted value and error residual calculated at the maximum likelihood estimate of h2. Other columns can be ignored and may be deleted in future versions. The column ERROR_RESIDUAL represents the residuals adjusted for the polygenic effect. These residuals can be treated as independent for analysis in programs that handle population samples.

<trait>.<file_suffix>.spor.cov.csv Same as poly version, but h2=0, so pedigree structure is ignored.

<trait>.<file_suffix>.spor.cov.csv Same as the poly version but h2=0, so pedigree structure is ignored.

--

Example Commands

mmap --ped <pedfile> --read_binary_covariance_file pedigree.bin --trait HDL --phenotype_id MYEGO --phenotype_file pheno.css --covariates AGE SEX BMI --covariate_file covarA.csv covarB.csv --file_suffix BMI will analyze the trait HDL adjusting for covariates AGE, SEX and BMI. MMAP will look for the trait in pheno.csv then the covariates in the files pheno.csv, covarA.csv covarB.csv. The subject ID is assumed to by MYEGO in all three files. The output files with start with HDL and include BMI in the filename. The relationship matrix will be read from the file pedigree.bin

map --ped <pedfile --read_binary_covariance_file pedigree.bin --trait HDL --phenotype_id MYEGO --phenotype_filename pheno.csv --covariates AGE SEX --file_suffix NO.BMI --binary_genotype_filename gwas.bin --model add --chromosome 4 X is similar as above except MMAP expects the covariates to be present in pheno.csv and BMI is dropped as a covariate. The file suffix is NO.BMI which will prevent the previous analysis results from being clobbered. MMAP will also perform marker analysis for chromosomes 4 and X using the additive model.

Working with the binary genotype file

MMAP.genotype.pdf

Importing data from other formats: PLINK, MINIMAC, IMPUTE2, VCF

MMAP.import.export.pdf

Genomic Relationship Matrices and PCs

MMAP.genomic.matrix.pdf

Variance Component Estimation

MMAP.variance.components.pdf

Pedigree and Environmental Covariance Matrices

MMAP.covariance.matrix.pdf

GxG, GxE, ExE Interaction Analysis and Sandwich Estimators

MMAP.interaction.pdf

Score Tests (including SKAT)

MMAP documentation for score tests MMAP.score.tests.pdf

R script to convert MMAP prepScores output into a skatCohort object mmap2seqMeta.R

MMAP scripts and examples for running score tests MMAP.score.test.tar.gz

MMAP snpinfo file used in CHARGE exome chip analysis SNPInfo_HumanExome_12v1_rev5_AnalysisCols_noDups.tab

..........

Acknowledgements

Thanks to Larry Bielak, May Montasser, Ankita Parihar, Lindsay Fernandez-Rhodes, Francesca Pavani, and Laura Yerges-Armstrong for testing, bug reports and feedback.