Computer workflows

Enter the name for this tabbed section: GenomeStudio

last update: 23/02/2012

Introduction

GenomeStudio is software by Illumina used to score SNPs based on the Illumina BeadExpress platform.

Example of a SNP that is typed well. There are three well-defined groups with limited overlap between them. X and O indicate calling errors based on imported pedigree relationships between the samples.

SNP calling is based on the ability to differentiate between three well-defined groups, two homozygotes and the heterozygotes. The less overlap there is between these groups, the better the SNP has been typed.

Populations should be scored separately. Different populations may have inversion differences, which will make mapping hard. If they have no common pedigrees, there is no advantage in scoring them together because no extra information becomes available. In contract, groups may be more obvious if scoring from a single populations is used.

Use all SNPs that you can, even if they are expected to be on the same genomic location. If they are, they can be removed at the mapping stage.

This is an iterative process, where each step affects previous steps and clarifies whether data look ok or not.

Remember to right click and select 'Recalculate statistics for all samples' in the samples table, after removing samples from the analysis.

Insert pedigree information

Pedigree information reveals scoring errors. Those errors may reflect errors in the pedigree file or the scoring, and can be used to clean up the data.

For the D. montana project, the names used in the pedigree file containing phenotypic information were bigger than the names on the eppendorfs containing the DNA, because the full name was too long to fit on eppendorfs. The pedigree name is a letter followed by a number and letters for each generation.

For example K154DBA is a 'K' pedigree (Kuusamo-Finland, the other one would be 'V' for Vancouver-Canada), line 154, F3 generation (as DBA is three letters, one for each generation).

Once pedigree information is added, SNP typing incompatible with the pedigree is indicated on the SNP graph. Any samples run twice are also indicated:

X indicates offspring with PC or PPC[^PCexplanation] errors based on a given pedigree.
O indicates parents with PC or PPC errors.
Square indicates errors in reproducibility, i.e. when the same sample is genotyped twice and the two replicates fall into different groups.

[^PCexplanation]: Parent-Child or Parent-Parent-Child

Replicate handling

Replicate samples give an idea of how well the typing works. Each sample must have an individual name, so the worse repeat is named to include 'repeat' in the end of its name.

Add replicate data

Put all files from a BeadExpress run into a folder
File > Load Additional Samples > Load samples based on folder with intensities

For the D. montana project there are two pedigrees and three GenomeStudio files, one for each pedigree and one containing both. The additional run with replicate data contained individuals from both replicates.

For the individual pedigree files:

Mark the files from the wrong pedigree for exclusion.
Save a copy of the file as.
Tick box to remove samples marked for exclusion.
Right click on each sample in the Samples Table and edit the names

Define the replicate samples. Analysis > Edit Replicates. Either import a text file where each row contains the names of the replicate samples, or define them from the 'Edit Replicates' popup window.

Remove worst replicate

Replicates could have been useful in defining groups to use for calling SNPs because they are expected to show the variance in calling the same sample.

Unfortunately the data were too messy in the D. montana project because the DNA extractions were not high quality. I decided to ignore the duplicate sample information and retained only the duplicate sample with most SNP calls:

Sort by Sample ID in the Samples Table and look for the name of sample with 'repeat' next to it.
Decide which run best by comparing the call frequency of these samples.
Add the sample with the least SNP calls into the group 'worse repeat'. This is a group I made for the purpose.
Right click, choose the 'worse repeat' samples, right click on one of them and 'exclude those samples from the analysis'.
If the originally run sample was the worst, rename it by appending 'worse repeat' to its name, and rename the best repeat to just the sample name, so that it matches the pedigree name.

Add gender information

Gender can be useful in defining groups and identifying sex linked samples.

Create a .csv file with two columns, Sample ID and Gender.

Create csv with sex information

The instructions assume there is an Excel file with the pedigree (ID, mother, father) and sex information, from which the correct columns need to be extracted.

Make sex.csv file

To make the sex.csv file, copy the pedigree name along with the sex column. Copy both the pedigree individuals AND the parental generation, i.e. the generation with information on sex, but unknown parents. Paste to a new text file and convert to CSV[^convertCSVperl]. Add Sample ID,Gender for header.
Import in GenomeStudio File > Import Phenotype Information from File...

[^convertCSVperl]: In the Terminal perl -pe 's/\t/,/g' tabdelimitedFile.txt > sex.csv

Click the 'Column Chooser' button above the Samples Table. Add the Gender column in the Displayed columns box, close to the Sample ID so that sorting by either is easy.

Upon successful import, the Gender column of the samples should be populated. It may not be visible by default, but it is useful to be able to sort by this column. To make it visible using the interface from 'Column Chooser'.

Improve calling

Exclude all individuals with call rates < 0.65 and make clusters automatically

Manually re-cluster each locus

Zero SNPs if:

They look ugly
Have low normalised R (y axis on graph)
Have too low polymorphisms to be useful

The Errors Table. If duplicate samples remain, they are also indicated and count towards the total.

X-linked genes are expected to have more pedigree errors.

The general rule is to move and resize the clusters to minimise the number of pedigree errors. However some errors may be there as a consequence of inherent bad calling in the particular study (for example due to low DNA quality). So also try to define three separate clusters.

Holding Shift, click and drag the centre of a cluster to move it and click and drag on its edges to resize it. The number of pedigree errors is visible in the bottom of the Errors Table.

Remove individuals with low call rates relative to others

Once re-clustering is done, rank by call rate. Remove any individuals with relatively low call rates when compared to other individuals.

Identify sex linkage

Typical indicator of sex linkage is for males to be under-represented amongst the heterozygotes, and have lower than average signal intensity.

To identify potential X linkage of a SNP, sort by Gender and then Het Excess. X-linked SNPs will have large and negative values for Het Excess. Visually, no male heterozygotes should exist on the graph showing the clusters.

If the DNA concentration of all samples was similar, males should have half the signal intensity than females, on X linked markers, because they contain a single copy of the SNP.

Remove mistyped males

Remove individual males that are heterozygous for X-linked loci, as judged by all other male data. If they are in HWE the loci may be autosomal after all.

Use duplicate samples to identify clusters

This approach did not work for the D. montana project. I kept the SNP run with the most calls amongst samples, see Remove worst replicate.

Minimise PPC errors

Exclude X-linked loci when estimating PPC errors.

Remove all individuals with high error rates, or check their pedigrees.

Remove loci with unusually high error rates.

Make a list with the names of X-linked markers

This is used later to modify genotyping information of the males for these markers, to help with genetic map construction.

Add the X-linked markers back to the data

Remove loci with high error rates

If any loci still have high error rates after excluding X-linked markers and pedigree errors, remove them.

Output

In GenomeStudio

Click on 'Report Wizard...'
Choose Final Report > All Samples > 50th percentile GC score ? > Remove excluded samples from report > Remove zeroed SNPs from report > Matrix - do not Include GenCall score > Finish.

Final report 3rd window. Choose 'Matrix'.

In the Terminal/Text editor

Make a SNP scoring file for R:

Delete the first SNP report lines in a text editor.
Transpose the file[^transposeLink] perl transposeTabDelimited.pl inputFile.txt > SNPData.txt.
Add 'ID' in the very beginning before the tab of the first line, to make it compatible for R import.

Make a pedData.txt file. This should be tab-delimited, contain the same name of individuals as the SNPData.txt and contain the headers:

ANIMAL_ID   SIRE_ID DAM_ID

In R

The following commands read the files and combine them based on the column names 'ANIMAL_ID' and 'ID', and export the merged information.

KData<-read.table("/Users/zabameos/Dropbox/GTD/Inbox/kdataManip/Ktransposed.txt", header=T)
VData<-read.table("/Users/zabameos/Dropbox/GTD/Inbox/kdataManip/Vtransposed.txt", header=T)
pedData<-read.table("/Users/zabameos/Dropbox/GTD/Inbox/kdataManip/pedData.txt", header=T)
sexData<-read.table("/Users/zabameos/Dropbox/GTD/Inbox/kdataManip/aPedSex.txt", header=T)
pedSexData=merge(pedData, sexData, by.x="ANIMAL_ID", by.y="Sample_ID", all=F )
mergedKData=merge(pedSexData, KData, by.x="ANIMAL_ID", by.y="ID", all=F )
mergedVData=merge(pedSexData, VData, by.x="ANIMAL_ID", by.y="ID", all=F )
write.table(mergedVData, file="VMerged.txt", quote=F, row.names=F, sep="\t")
write.table(mergedKData, file="KMerged.txt", quote=F, row.names=F, sep="\t")

Back in the Terminal/Text editor

Generate a tab between each allele of a SNP and convert missing data to 0s:

perl -pe 's/\-\-/0\t0/g ; s/\t(\D)(\D)/\t$1\t$2/g' pedigreeMerged.txt > out.txt

In a text editor:

Delete the first rows up to the sample names, the names of which have been scrambled.
Copy the sample names from the pedigreeMerged.txt file.
For the first row (just copied) replace tabs with double tabs for the columns with marker names, and paste on the top of the file.

Check all is well by opening in Excel. The input crigen expects is of the form:

Example of crigen input.
ID	Marker1		Marker2		Marker3		Marker4		Marker5
1	4	4	3	4	2	1	3	1	2	1
2	1	4	4	4	1	2	2	3	6	1
3	1	1	3	4	1	1	1	1	4	2

Change typing of males to make the X-chromosome mappable

Males should all be homozygotes for X markers, which is obvious for some markers in Genomestudio.

These males should be manually made heterozygote with another allele. However it must be one of the DNA letters, e.g. T. Use anything other than what the males already are. The Pedigree file should also contain sex information.

Also take the opportunity to zero any heterozygote males.

Format the pedigree for crimap

Crimap does not accept letters on the pedigree names. Originally I used a perl script, however it did not result in unique names. I then manually changed the names by find and replace for all individuals in Excel to make sure the names are unique.

This also allowed to visualise how many of the parents are not scored, because I renamed sequentially based on the offspring row, so if some offspring were not there but were parents of other individuals, they would have kept their original names in the parent columns. I then sequentially renamed the parent columns.

It is useful to get a feel of the data by doing it manually, but it is much faster to do it in R.

Rename factors to numbers in R

First collect all columns from the pedigree files into a single column and sort alphabetically in Excel. Output as text as pedigreeNames.txt, no header is needed. Also copy the first three columns of the excel file into a correctVPed.txt file. Import both in R.

pedData<-read.table("/Users/zabameos/Desktop/temp/correctVPed.txt", header=T)
pedigreeNames<-read.table("/Users/zabameos/Desktop/temp/pedigreeNames.txt", header=F)

The following code renames the individual names to numbers and applies the naming to the pedigree file.

oldNames <- unique(pedigreeNames$V1)
newNames <- seq(1:length(oldNames))
nameData <- data.frame(oldNames, newNames)

newID = merge(pedData, nameData, by.x="ANIMAL_ID", by.y="oldNames", all=F )
names(newID)[names(newID)=="newNames"] = "newID"
newSireID = merge(newID, nameData, by.x="SIRE_ID", by.y="oldNames", all=F )
names(newSireID)[names(newSireID)=="newNames"] = "newSireID"
newDamID = merge(newSireID, nameData, by.x="DAM_ID", by.y="oldNames", all=F )
names(newDamID)[names(newDamID)=="newNames"] = "newDamID"

allPedigreeNames <- data.frame(newDamID)
write.table(allPedigreeNames, file="allPedigreeNames.txt", quote=F, row.names=F, sep="\t")

The output file contains the original input pedigree and the new names.

Replace the number that replaced 0 in the original pedigree with 0.
Sort it by ID in Excel and make sure the Excel file is sorted by ID. Copy and paste the new names onto the old ones in Excel.
Copy all data to a text file[^linuxBreak] and save with a .gen extension.

Congratulations! You have a file that can be used by Crigen.

Unexplained errors - Thoughts

Program bugs

It is not possible to unmark a sequence, so be careful of what markings you use.

Once some sequences are removed from the analysis, they need to be manually processed again. I encountered this when excluding the X linked markers for identifying real pedigree errors: they have to be clustered again manually.

General

Pedigree names were complicated to keep track of, and resulted in duplicate shortened names on the eppendorfs which were tricky to track down and had to be excluded. A better approach would have been to just name the individuals sequentially.

Male-biased failure

Example where male samples failed to amplify, while some female samples worked.

In some cases, mostly in the Vancouver pedigree (40 occasions compared to 8 in Oulanka) the SNPs of males would fail, while some female SNPs would work.

One possible explanation is that these are sex-linked markers and females have more targets that can amplify, and are therefore biased to work. In any case most of the samples did not work, so the SNPs were considered unscorable.

Action taken - SNPs considered unscorable, as some females would also fail. For example in the figure only 68 + 13 samples worked, so most samples failed.

Above-average pedigree errors

Examples where there were more pedigree errors than an average SNP for the particular pedigree. Moving the groups did not result in fewer errors in these cases.

Both Vancouver and Oulanka had some pedigree errors even after carefully checking the pedigrees for duplicates, wrong names etc. The number of errors was similar for most SNPs and can be traced to its source. However some SNPs showed an uncharacteristically high number of pedigree errors. These may sometimes be removed by changing the group assignment manually.

Example where pedigree errors have resulted from bad group calling.

There was usually no alternative group calling that reduced the number of errors. One possible source is multiple targets to the same SNP, for example if the SNPs are on transposable elements.

One possible explanation is that sequences with these SNPs are repetitive DNA, so there were multiple targets for them in the genome.

Action taken - Removed from the analysis.

Heterozygote-biased pedigree errors

Example where most pedigree errors occur amongst heterozygotes.

For some samples most errors seemed to be on female heterozygotes (if there were male heterozygote errors the SNP might be sex-linked).

One possibility is that those samples are sex-linked, but were not called as such because there are data on only one homozygote. However in many cases they included females. The samples are particularly dodgy if one of the homozygotes is completely missing.

Action taken - Removed from the analysis.

X-linked marker deficit

In D. montana the X is the largest chromosome so about 25% of the markers were expected to be on it. Much fewer markers are X-linked. This may be because SNPs were called from both sexes, so there would be 75% coverage on the X compared to the autosomes, which may have biased against X-markers due to low reads achieved.

Be conservative

The golden rule of mapping is that 'No data is better than bad data'.

Example of conservative scoring. The old scoring tries to minimise pedigree errors while scoring as many individuals as possible, while the conservative scoring is trying to minimise pedigree errors.

Pedigree errors show only some cases of bad SNP calls. Assuming that there are no pedigree errors[^pedigreeErrors], because the pedigree file has been checked specifically for them, the errors seen are due to wrong assignment of SNPs to groups. In general for the first attempts at mapping it is better to make the groups smaller and exclude individuals further away from the region showing pedigree errors. This ensures that the individuals with wrong genotypes are minimised.

Keep only pedigree individuals kept in scoring

Some individuals failed to score, even though there is pedigree information in them. It is a good idea to manually check which these individuals are to detect any mistakes.

In Excel, copy the original pedigree below the names of the scored samples. Make the original pedigree bold and sort by individual ID.

Cut and pasted any parents of rows that existed in the scored pedigree (they are one row below or above). This is a manual step but important to identify any leftover mistakes in the pedigree structure, for example the same individual ID having multiple parent pairs.

This needs to be done once. The final output is the pedData.txt, which can be used to automatically add pedigree information to other files, such as SNP scoring from GenomeStudio or individual phenotypic information from other text files.

[^pedigreeErrors]: Real pedigree errors can be picked up during mapping.

[^transposeLink]: Use the perl script transposeTabDelimited.pl. It is possible to transpose in Excel too (select all) copy > paste special > (tick) transpose

[^linuxBreak]: Don't save from Excel, because the .gen file needs to have Linux line format line endings.

Enter the name for this tabbed section: Crimap

last update: 16/10/2013

pdf version

Background

The original Crimap is out of date. It was first written when allozymes were the prevalent molecular markers. Consequently, it has some problems dealing with large pedigrees and large numbers of markers that article available for analysis today.

To deal with this we use the Crigen suite of programs, which prepare text files to make Crimap more capable and efficient with handling modern datasets which are larger and more complicated.

An improved version of Crimap was produced by Jill Maddox and Ian Evans (2.504) which is generally better than the old. However the Crigen suite of programs is not compatible with the improved Crimap.

Consequently, analysis starts with Crigen, then the old Crimap, then continues with Crigen (Autogroup) to produce chromosome-specific files, and then moves to the new Crimap which is better for the analysis.

Michael D Grosz and Monsanto have been providing the Crigen code to researchers to serve the community, and have kindly granted permission to distribute the Crigen programs through this site.

Jill Maddox and Ian Evans, who are still developing the improved Crimap, provide it for free through the animal genome website. Downloading requires to fill a short form with information on how it is used. It is important to provide as much information as possible, to help with making sure it works properly with current experimental designs, and to help with funding which is easier if a large user base can be demonstrated.

It is best to compile Crimap on the particular machine or OS used, to make sure it works with the available memory efficiently. For example it is possible fora pre-compiled version to crash when it runs out of memory, but a version compiled on the same machine to work fine. In practice, a pre-compiled version should work well enough, except if it crashes, in which case you should compile your own version which may not crash.

In practice I start the analysis with Crimap 2.4 and Crigen on Ubuntu Linux, running on a virtual machine on MacOSX because I could not compile the old programs in a modern machine. I then continue to the Crimap 2.5 which can be compiled on MacOSX (see Installation Instructions)

All commands have more options with which to run, however the following steps can be used as a backbone of the analysis. The full options are available from the three manuals that come with Crimap.

Workflow

Prepare a .gen file

The general format of a .gen file is shown in the table.

ANIMAL_ID   SIRE_ID DAM_ID  Marker1 Marker2 Marker3   
1   0   0   A   C   A   T   A   A   
2   0   0   A   C   A   G   A   C   
3   1   2   A   C   A   G   A   C

Individuals must be named with numbers.

Sex-linked markers will be homozygotes for one sex (for example in males in Drosophila). Edit their genotyping so they are heterozygous, with one allele not present in females. For example a locus which is scored as either A or T and in males is either AA or TT should be changed in males to become AC or TC. Alternatively, the non existent allele can be set to -1.

Full instructions on how to prepare a .gen file from GenomeStudio are at the my GenomeStudio workflow.

Generate a Twopoint file for Autogroup

Separate the pedigree into families of about 50 individuals for 3 generations. These files are easier for Crimap to work with.

./crigen -g myData.gen -o 10 -size 50 -gen 3

./crimap 10 prepare

choose no no no and then option 7 (twopoint) yes yes

Note, segmentation errors can be reported during the 'prepare' command. I found that they were resolved by choosing Linux encoding for line endings in the .gen file. This is the default way text files are saved from BBEdit or Textmate.

![Save as window in TextMate. Choose 'LF (recommended)' for line encoding.][figLineEndingEncoding]

Alternatively, it is possible to convert line endings to Linux format in the terminal using

r '\r' '\n' < in.txt > out.txt

next, run twopoint

./crimap 10 twopoint > chr10.tpt

The output of this command differs between the new and old crimap versions. The old version provides the sex-average lod score while the new version provides the sex-specific lod score. Up to now we should be running the old version because the Autogroup command (below) expects its input in the old format.

chr10.tpt shows all two-way linkages between markers. Consult the lod scores (end of each line). They give a feel of how linked the markers are. The file can be consulted to find markers that are not linked, to be assigned to different chromosomes.

Identify obvious errors

Investigate the individuals that cause many inheritance errors by cross-referencing to their scoring in GenomeStudio. If the errors cannot be reduced, mark the individuals as un-scored for the error-prone markers (manually in the genotyping file). If the same individual causes errors with many markers consider removing the individual from the analysis.

An easy way to identify pedigree errors from a genotyping file is to use GenotypeChecker by the Roslin institute. The .gen file that Crimap uses needs to be converted to work with GenotypeChecker.

Update 131217 - A newer program VIPER has been developers by Roslin Institute, which is a little different but gives more information about errors in a compact form. I have not tried it out but it is probably worth trying out.

Prepare files for Genotype Checker

The following instructions generate files for Genotype Checker using R. They are adapted from a script provided by Jon Slate. The paths assume you are working on a Mac.

The two input files and the associated R script are available here, but also described in the text below.

Input files

We need two input files per chromosome:

A genotype file which looks like

Progeny Sire    Dam Sex MarkerA.1   MarkerA.2   MarkerB.1   MarkerB.2   
98  0   0   M   A   G   A   G
168 98  1   F   G   G   G   G
169 57  100 F   A   G   A   G
170 57  100 F   G   G   A   G
99  0   0   F   A   G   A   G

and a map file which looks like

marker  position
MarkerA 0.000
MarkerB 43.987

In R

Read in the genotype file

genotype_file<-read.table("/Users/username/Desktop/genotype_file.txt", header=T)

Retain just the pedigree information

GenotypeCheckerPedFile <- genotype_file[,1:4]

Import map data and add a marker order column to them

example.map <- read.table('/Users/username/Desktop/exampleMap.txt', header=T, sep="\t")
maporder <- example.map

Store the number of markers in a variable

m <- (ncol(genotype_file)-4) * 0.5

Store marker order in a column

maporder$maporder <- seq (1,m,1)

Make the genotype file for genotype checker

GenotypeCheckerMarkerFile <- genotype_file[,c(1,5,6)]
GenotypeCheckerMarkerFile[,4] <- maporder[1,1]
GenotypeCheckerMarkerFile <- GenotypeCheckerMarkerFile [,c(1,4,2,3)]
colnames(GenotypeCheckerMarkerFile) <- c("ID","Locus","Allele1","Allele2")

for (i in 2:m) {
    GenotypeCheckerTemp <- genotype_file[,c(1,(2*i+3),(2*i+4))]
    GenotypeCheckerTemp[,4] <- maporder[i,1]
    GenotypeCheckerTemp <- GenotypeCheckerTemp[,c(1,4,2,3)]
    colnames(GenotypeCheckerTemp) <- c("ID","Locus","Allele1","Allele2")
    GenotypeCheckerMarkerFile <- rbind (GenotypeCheckerMarkerFile,GenotypeCheckerTemp)
}

Find and replace unknown genotypes with ? required by genotype checker

GenotypeCheckerMarkerFile$Allele1 <- gsub ("^0","?", GenotypeCheckerMarkerFile$Allele1)
GenotypeCheckerMarkerFile$Allele2 <- gsub ("^0","?", GenotypeCheckerMarkerFile$Allele2)

Export data

write.table (GenotypeCheckerMarkerFile, '/Users/username/Desktop/GenotypeCheckerMarkerFile', col.names=F, sep="\t",quote=F,row.names=F)
write.table (GenotypeCheckerPedFile, '/Users/username/Desktop/GenotypeCheckerPedFile', col.names=F, sep="\t",quote=F,row.names=F)

Crimap pedigree error reporting

After using GenotypeChecker, this section is not as useful, but I have kept it in case someone finds it useful.

The crimap command above outputs inheritance errors in the terminal window. If they are too many, they do not fit in the terminal window even if scrolling up. The following steps recover them and use them to make a manageable text file which can be used to identify particularly error-prone individuals, which may reflect pedigree errors or bad DNA preparation for those individuals.

Run the script command in the Terminal. This outputs all Terminal text to a typescript.txt file. Run the commands as normal and then run exit. For example:

Enter the name for this tabbed section: Spike song analysis

Download

Enter the name for this tabbed section: PerlPrimer installation

How to install perlprimer

Instructions edited from here.

This worked on 30/09/2014 for two independent computers.

It is important to follow the steps in order. Makefiles made when one of the preceding steps was not performed will not work even after the step is performed. In that case, delete the folder and start by unzipping the downloaded file of the particular program.

1. Start with Perl >= 5.8.1.

My mac had 5.16.2

perl -v

2. Install libwww-perl.

Download

Run the following, as in the readme file

perl Makefile.PL  
make  
make test
sudo make install

3. Install Xcode from the app store

Once downloaded, run it once so it installs stuff.

4. Install X11

Through xquartz. Run the dmg file installer.

Restart the computer.

5. Install Perl/Tk from CPAN

Download latest release

perl Makefile.PL  
make  
make test
sudo make install

6. Install Spidey executable from NCBI

Download

Add it to the command path (e.g. /usr/local/bin)

7. Download Perlprimer

Perlprimer source (.zip archive).

To run, go to the Perlprimer folder and run

perl perlprimer.pl

It should open a GUI.