VCF to Lep-MAP

The following steps take a .vcf file with a full sib pedigree and turn it into a .linkage file for Lep-MAP.

It was originally written to analyse .vcf files from SEX-DETector, but instructions to analyse .vcf files from stacks are also included.

VCF filtering

Some steps below depend on the .vcf file, this is indicated next to the relevant command.

The .vcf from stacks v1.41 looked like this ('stacks.vcf').

\##fileformat=VCFv4.2  
\##fileDate=20160725  
\##source="Stacks v1.41"  
\##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">  
\##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">  
\##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">  
\##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">  
\#CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  id1 id2 id3  
un  2   3   C   T   .   PASS    NS=60;AF=0.300  GT:DP   ./.:0   ./.:0   ./.:0  
un  95  65  T   G   .   PASS    NS=59;AF=0.466  GT:DP   1/0:32  1/0:21  1/0:41

The .vcf from SEX-DETector looked like this ('V4.vcf').

\##fileformat=VCFv4.0  
\##source=reads2snp  
\##phasing=unphased  
\##FILTER=<ID=multi,Description="At least one individual shows more than 2 alleles">  
\##FILTER=<ID=unres,Description="Insufficient sequencing depth or uncertain genotyping in all individuals">  
\##FILTER=<ID=para,Description="Suspicion of paralogy">  
\##INFO=<ID=N,Number=1,Type=Integer,Description="Number of genotyped individuals">  
\##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">  
\#CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  id1 id2 id3  
comp308_c0_seq1|m.78    300 .   G   A   .   PASS    .   GT  0|1 0|1 0|0  
comp768_c0_seq1|m.214   275 .   T   C   .   PASS    .   GT  0|0 0|1 0|0

You need vcftools and Lep-MAP. The commands below should work if you replace the text myFile to your .vcf file name.

Filter the vcf to only keep good SNPs and remove indels.

vcftools --vcf myFile.vcf --min-alleles 2 --max-alleles 2 --max-missing 1 --mac 5 --remove-indels -c --recode > myFile_filtered.vcf

VCF to linkage file

Look at the header of the vcf to extract the individual names in the correct order after transposition (perl script at the end of this page).

head myFile_filtered.vcf | perl transposeTabDelimited.pl

The result will contain the individual names, copy and paste in a new text file to serve as a skeleton for the required pedigree file, which needs to be written by hand. It will look something like this:

F1_e  
F1_d  
F1_c  
F1_b  
F1_a  
father1  
mother1

In a text editor, add the following columns. Name the resulting file myFile.ped. Make sure to include the header with a #.

\# family   id  father  mother  sex(1M,2F)  phenotype(need one column at least, it can be 0)  
1   7   2   1   2   0  
1   6   2   1   2   0  
1   5   2   1   2   0  
1   4   2   1   2   0  
1   3   2   1   2   0  
1   2   0   0   1   0  
1   1   0   0   2   0

The first line is not read because of the # symbol.
If there is more than 1 family, add another number in the first column.
Unkown parents are marked with 0.
Males are 1, females are 2.
The phenotype column can be 0 as it is not used by Lep-MAP, but it must have a value.

Next, replace the .vcf genotyping encoding with Lep-MAP expectation, shorten contig names and remove header lines.

For V4.vcf

perl -pe 's/1\|1/2 2/g ; s/0\|1/1 2/g ; s/1\|0/2 1/g ; s/0\|0/1 1/g ; s/\|// ; s/.\|./0 0/g ; s/comp// ; s/seq/s/ ; s/\t/\_/' <(grep -v '#' myFile_filtered.vcf) > myFile_temp.txt

For stacks.vcf

perl -pe 's/\//\|/g ; s/\:\d+\t/\t/g ; s/\:\d+\n/\n/ ; s/1\|1/2 2/g ; s/0\|1/1 2/g ; s/1\|0/2 1/g ; s/0\|0/1 1/g ; s/.\|./0 0/g ;  s/\|// ; s/comp// ; s/seq/s/' <(grep -v '#' myFile_filtered.vcf) > myFile_temp.txt

Combine columns 1 and 9+ (V4.vcd) from the resulting file into one file, and transpose it.

For V4.vcf

paste <(cut -f 1 myFile_temp.txt) <(cut -f 9- myFile_temp.txt) | perl -pe 's/ /__/g' | perl transposeTabDelimited.pl | perl -pe 's/__/ /g' > myFile_temp2.txt

Same for stacks.vcf

paste <(cut -f 1 myFile_temp.txt) <(cut -f 10- myFile_temp.txt) | perl -pe 's/ /__/g' | perl transposeTabDelimited.pl | perl -pe 's/__/ /g' > myFile_temp2.txt

Paste pedigree and data with commented '#' top line to make a linkage file

paste myFile.ped myFile_temp2.txt > myFile.linkage

You should be done! Check it looks as expected

cut -f 1,2,3,4,5,6,7,8,9 myFile.linkage

The output should look like this (first 3 markers shown)

1   7   2   1   2   0   1 2 1 2 1 2  
1   6   2   1   2   0   1 2 1 2 1 2  
1   5   2   1   2   0   1 2 2 2 2 2  
1   4   2   1   2   0   1 2 2 2 2 2  
1   3   2   1   2   0   1 2 2 2 2 2  
1   2   0   0   1   0   1 2 2 2 2 2  
1   1   0   0   2   0   2 2 2 2 2 2

If SEX-DETector was run on multiple families

The .vcf from the different families need to be combined into a single .vcf file. This is not straightforward, because occasionally there is a different minor allele in the reference, in different families. If the families are small, it is also possible for tri-allelic SNPs to appear bi-allelic but at different variants, in the different families, and these SNPs need to be manually removed.

Suppose the families are called m6f4 and m3f5.

Make headers for files

less m6f4_filtered.vcf

Manually copy the individual names to a text editor, replace gaps with tabs, paste in Excel, transpose, paste back to a text editor, select columns and add _1 to end of names, enter and paste again to add _2 to names, paste to excel to transpose. Save as m6f4_header.txt. It should look like this (example with 3 individuals):

mR_1    mR_2    mM_1    mM_2    mI_1    mI_2

Make tped files

vcftools --plink-tped --vcf m6f4_filtered.vcf

Manually paste resulting tped file in Excel, remove columns 1, 3, 4 and add the header already made. Save as m6f4_SNPs.txt. The first three lines look like this (for 3 individuals):

ID  mR_1    mR_2    mM_1    mM_2    mI_1    mI_2  
comp308_c0_seq1|m.78:300    G   A   G   A   G   G  
comp768_c0_seq1|m.214:275   T   T   T   C   T   T  
comp1073_c0_seq1|m.289:210  A   A   A   G   A   A

Repeat for the other family (m5f3).

Once both files are ready, work in R

Read files

m5f3 <- read.table("m5f3_SNPs.txt", header=T)
str(m5f3)
m6f4 <- read.table("m6f4_SNPs.txt", header=T)
str(m6f4)
merged = merge(m6f4, m5f3, by.x="ID", by.y="ID", all=F)
write.table(merged, file="common_SNPs.txt", quote=F, row.names=F, sep="\t")

Markers with more than 2 alleles, or markers causing too long map distances, can be removed manually from the common_SNPs.txt file

Replace the tab between two alleles with spaces. Manually change the code according to the number of individuals.

awk '{print $1,"\t",$2,"\ ",$3,"\t",$4,"\ ",$5,"\t",$6,"\ ",$7,"\t", sep="\t"}' common_SNPs.txt | perl -pe 's/   /zaba/g ; s/ \t /\t/g' | perl transposeTabDelimited.pl | perl -pe 's/zaba/ /g' > common_SNPs_reformatted.txt

Should result in something like this:

ID  comp10005_c0_seq1|m.23150:473   comp10006_c0_seq1|m.4496:1446   comp10006_c0_seq1|m.4497:1008  
mR_1 mR_2       T C     C T     G A  
mM_1.x mM_2.x   T C     C T     G A  
mI_1 mI_2       T C     C C     G G  
mG_1.x mG_2.x   T C     C C     G G

Start making a pedigree file

cut -f1 common_SNPs_reformatted.txt > common_pedigree.txt

Manually turn into a pedigree file (the following, without the first column)

\#ignoreorigID  family  id  father  mother  sex(1M,2F)  phenotype(0)
R   1   5   1   2   1   0  
M.x 1   6   1   2   1   0  
I   1   7   1   2   1   0  
G.x 1   8   1   2   1   0  
F.x 1   9   1   2   1   0  
E.x 1   10  1   2   1   0  
D.x 1   11  1   2   1   0  
C.x 1   12  1   2   1   0  
A   1   13  1   2   1   0  
Q   1   14  1   2   2   0  
P   1   15  1   2   2   0  
O   1   16  1   2   2   0  
N   1   17  1   2   2   0  
L.x 1   18  1   2   2   0  
K   1   19  1   2   2   0  
J.x 1   20  1   2   2   0  
B   1   21  1   2   2   0  
6   1   1   0   0   1   0  
4   1   2   0   0   2   0  
O   2   22  3   4   1   0  
N   2   23  3   4   1   0  
M.y 2   24  3   4   1   0  
K   2   25  3   4   1   0  
H   2   26  3   4   1   0  
G.y 2   27  3   4   1   0  
F.y 2   28  3   4   1   0  
E.y 2   29  3   4   1   0  
D.y 2   30  3   4   1   0  
C.y 2   31  3   4   1   0  
B   2   32  3   4   1   0  
L.y 2   33  3   4   2   0  
J.y 2   34  3   4   2   0  
I   2   35  3   4   2   0  
A   2   36  3   4   2   0  
5   2   3   0   0   1   0  
3   2   4   0   0   2   0

remove first line from common_SNPs_reformatted.txt into new file common_marker_names.txt.

combine the two into a .ped file

paste common_pedigree.txt <(cut -f 2- common_SNPs_reformatted.txt) > out.ped

make map file

perl transposeTabDelimited.pl common_marker_names.txt > out.map

manually turn into map file in the following format (in Excel, the last column is a number for each marker, this is the number the marker name is converted to in Lep-MAP).

1   comp10005_c0_seq1|m.23150:473   0   1  
1   comp10006_c0_seq1|m.4496:1446   0   2  
1   comp10006_c0_seq1|m.4497:1008   0   3  
1   comp10006_c0_seq1|m.4497:1020   0   4

Download plink and use it on the out.map file.

./plink --file out --recode12 --tab

Works! Makes a out.ped file. Just replace -9 with 0 (phenotype column) and rename to combinedSNPs.linkage.

Also make a file with just the marker names and the corresponding Lep-MAP number.

cat <(echo -e "ID\tid_number") <(cut -f2,4 out.map) > marker_names.txt

Lep-MAP

You can now run Lep-MAP. I put the files in a project_name folder in the Lep-MAP directory, hence the ../bin part of the code, which differs to the Lep-MAP wiki.

java -cp ../bin Filtering data=myFile.linkage dataTolerance=0.01 > myFile_f.linkage

java -cp ../bin SeparateChromosomes data=myFile_f.linkage lodLimit=6 sizeLimit=4 > map1.txt

java -cp ../bin JoinSingles map1.txt data=myFile_f.linkage lodLimit=3 > map1_js.txt

for i in `seq 1 8`; do java -cp ../bin OrderMarkers data=myFile_f.linkage map=map1_js.txt chromosome=$i >order.SS$i.txt 2>order.SS$i.err & done

for i in `seq 1 8`; do java -cp ../bin OrderMarkers data=myFile_f.linkage map=map1_js.txt sexAveraged=1 chromosome=$i >order.SA$i.txt 2>order.SA$i.err & done

If you want to remove misbehaving markers, use removeMarkers=markername1 markername2

java -cp ../bin/ OrderMarkers map=map1_js.txt data=combinedSNPs._f.linkage sexAveraged=1 removeMarkers=5130 chromosome=8 > order8.SArm.txt

Combine all chromosome maps into one file, adding the filename in the first column of each line, remove lines without map location information, rename the column with the filename to just the LG number. Then make a file with only the information to use for mapping (LG, id and position (either common, male or female)) which is also added as a header. Do this for the sex specific and the sex averaged maps.

for i in $(ls | grep order | grep SA.txt); do awk '{print FILENAME,$0}' $i;done | grep -v COUNT | grep -v java | grep -v male | grep -v likelihood | perl -pe 's/order// ; s/.SA.txt /\t/' > LOD10_SA.txt

cat <(echo -e "LG\tid\tcpos") LOD10_SA.txt | cut -f 1-3 > LOD10_SA_data.txt

for i in $(ls | grep order | grep SS.txt); do awk '{print FILENAME,$0}' $i;done | grep -v COUNT | grep -v java | grep -v male | grep -v likelihood | perl -pe 's/order// ; s/.SS.txt /\t/' > LOD10_SS.txt

cat <(echo -e "LG\tid\tmpos\tfpos") LOD10_SS.txt | cut -f 1-4 > LOD10_SS_data.txt

Join the sex averaged and common files based on column 2 in both (id, file1 col2) and only keep LG (file1 col1), common position (file 1 col3), male position (file2 col3) and female position (file2 col4). If there are lines that are not joined, print NA. This only works if you first sort based on the columns to join by of the input files. Sort the final file by column 2 (LG) and then column 3 (common position).

join -1 2 -2 2 -o 1.2,1.1,1.3,2.3,2.4 -e NA <(sort -k2 LOD10_SA_data.txt) <(sort -k2 LOD10_SS_data.txt) | sort -k2 -k3 -n > LOD10_all.txt

Obtaining pictures

The aim is to make a .mct file that is used to draw maps in MapChart.

Shorten marker names, and remove SNP position in a separate column. Then change header to have 3 columns.

perl -pe 's/comp// ; s/_c/_/ ; s/_seq/_/ ; s/\|m./_/ ; s/\:/\t/' marker_names.txt > marker_renames_temp.txt
cat <(echo -e "ID\tSNP\tid_number") <(tail -n +2 marker_renames_temp.txt) > marker_renames.txt

Combine columns to make male, female, and combined sexes map files.

join -1 1 -2 3 -o 2.1,1.2,1.3 <(sort -k1 LOD10_all.txt) <(sort -k3 marker_renames.txt) | sort -u | sort -k2 -k3 -n > LOD10_common.map

join -1 1 -2 3 -o 2.1,1.2,1.4 <(sort -k1 LOD10_all.txt) <(sort -k3 marker_renames.txt) | sort -u | sort -k2 -k3 -n > LOD10_male.map

join -1 1 -2 3 -o 2.1,1.2,1.5 <(sort -k1 LOD10_all.txt) <(sort -k3 marker_renames.txt) | sort -u | sort -k2 -k3 -n > LOD10_female.map

Suppose you have a file that lists contigs of interest. Each contig is represented in a line, and a second column has a new name for the contig to attract attention. For example the file is called findlist.txt and looks like:

17869_0_1_9812  17869_0_1_9812_SEXLINKED  
18630_0_1_10248 18630_0_1_10248_SEXLINKED

Replace the names of the contigs of interest in the .map files with their attention grabbing name of the second column of the findlist.txt file.

for file in $(ls *.map)
do
awk 'NR==FNR {a[$1]=$2;next} {for ( i in a) gsub(i,a[i])}1' findlist.txt $file > temp.txt
mv  temp.txt $file
done

Manually turn each .map file into a .mct file.

Separate the LGs by one empty line.
Replace the _SEXLINKED with a single UB (underline bold) in the end of the line they appear.
Remove column 2 which shows the LG number.
Add the following line before each LG (LG name, start position, end position)

chrom "Fem LG 1" S=0 E=maxLength
Save as .mct file and open in MapChart.

The .emf pictures MapChart produces are best processed by

Open in Microsoft Word
save as pdf
open pdf in Photoshop and combine in single picture

transposeTabDelimited.pl

#!/bin/perl -w
#
# SO 1729824

# Script found at http://stackoverflow.com/questions/1729824/transpose-a-file-in-bash

use strict;

my(%data);        # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
    my(@row) = split /\s+/;
    my($colnum) = 0;
    foreach my $val (@row)
    {
        $data{$rownum}{$colnum++} = $val;
    }
    $rownum++;
    $maxcol = $colnum if $colnum > $maxcol;
}

my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
    for (my $row = 0; $row < $maxrow; $row++)
    {
        printf "%s%s", ($row == 0) ? "" : "\t",
                defined $data{$row}{$col} ? $data{$row}{$col} : "";
    }
    print "\n";
}

Tag Mapping

For Academics