VCF to Lep-MAP
The following steps take a .vcf file with a full sib pedigree and turn it into a .linkage file for Lep-MAP.
It was originally written to analyse .vcf files from SEX-DETector, but instructions to analyse .vcf files from stacks are also included.
VCF filtering
Some steps below depend on the .vcf file, this is indicated next to the relevant command.
The .vcf from stacks v1.41 looked like this ('stacks.vcf').
\##fileformat=VCFv4.2
\##fileDate=20160725
\##source="Stacks v1.41"
\##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
\##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
\##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
\##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
\#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT id1 id2 id3
un 2 3 C T . PASS NS=60;AF=0.300 GT:DP ./.:0 ./.:0 ./.:0
un 95 65 T G . PASS NS=59;AF=0.466 GT:DP 1/0:32 1/0:21 1/0:41
The .vcf from SEX-DETector looked like this ('V4.vcf').
\##fileformat=VCFv4.0
\##source=reads2snp
\##phasing=unphased
\##FILTER=<ID=multi,Description="At least one individual shows more than 2 alleles">
\##FILTER=<ID=unres,Description="Insufficient sequencing depth or uncertain genotyping in all individuals">
\##FILTER=<ID=para,Description="Suspicion of paralogy">
\##INFO=<ID=N,Number=1,Type=Integer,Description="Number of genotyped individuals">
\##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
\#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT id1 id2 id3
comp308_c0_seq1|m.78 300 . G A . PASS . GT 0|1 0|1 0|0
comp768_c0_seq1|m.214 275 . T C . PASS . GT 0|0 0|1 0|0
You need vcftools and Lep-MAP. The commands below should work if you replace the text myFile
to your .vcf file name.
Filter the vcf to only keep good SNPs and remove indels.
vcftools --vcf myFile.vcf --min-alleles 2 --max-alleles 2 --max-missing 1 --mac 5 --remove-indels -c --recode > myFile_filtered.vcf
VCF to linkage file
Look at the header of the vcf to extract the individual names in the correct order after transposition (perl script at the end of this page).
head myFile_filtered.vcf | perl transposeTabDelimited.pl
The result will contain the individual names, copy and paste in a new text file to serve as a skeleton for the required pedigree file, which needs to be written by hand. It will look something like this:
F1_e
F1_d
F1_c
F1_b
F1_a
father1
mother1
In a text editor, add the following columns. Name the resulting file myFile.ped
. Make sure to include the header with a #
.
\# family id father mother sex(1M,2F) phenotype(need one column at least, it can be 0)
1 7 2 1 2 0
1 6 2 1 2 0
1 5 2 1 2 0
1 4 2 1 2 0
1 3 2 1 2 0
1 2 0 0 1 0
1 1 0 0 2 0
- The first line is not read because of the
#
symbol. - If there is more than 1 family, add another number in the first column.
- Unkown parents are marked with 0.
- Males are 1, females are 2.
- The phenotype column can be 0 as it is not used by Lep-MAP, but it must have a value.
Next, replace the .vcf genotyping encoding with Lep-MAP expectation, shorten contig names and remove header lines.
For V4.vcf
perl -pe 's/1\|1/2 2/g ; s/0\|1/1 2/g ; s/1\|0/2 1/g ; s/0\|0/1 1/g ; s/\|// ; s/.\|./0 0/g ; s/comp// ; s/seq/s/ ; s/\t/\_/' <(grep -v '#' myFile_filtered.vcf) > myFile_temp.txt
For stacks.vcf
perl -pe 's/\//\|/g ; s/\:\d+\t/\t/g ; s/\:\d+\n/\n/ ; s/1\|1/2 2/g ; s/0\|1/1 2/g ; s/1\|0/2 1/g ; s/0\|0/1 1/g ; s/.\|./0 0/g ; s/\|// ; s/comp// ; s/seq/s/' <(grep -v '#' myFile_filtered.vcf) > myFile_temp.txt
Combine columns 1 and 9+ (V4.vcd) from the resulting file into one file, and transpose it.
For V4.vcf
paste <(cut -f 1 myFile_temp.txt) <(cut -f 9- myFile_temp.txt) | perl -pe 's/ /__/g' | perl transposeTabDelimited.pl | perl -pe 's/__/ /g' > myFile_temp2.txt
Same for stacks.vcf
paste <(cut -f 1 myFile_temp.txt) <(cut -f 10- myFile_temp.txt) | perl -pe 's/ /__/g' | perl transposeTabDelimited.pl | perl -pe 's/__/ /g' > myFile_temp2.txt
Paste pedigree and data with commented '#' top line to make a linkage file
paste myFile.ped myFile_temp2.txt > myFile.linkage
You should be done! Check it looks as expected
cut -f 1,2,3,4,5,6,7,8,9 myFile.linkage
The output should look like this (first 3 markers shown)
1 7 2 1 2 0 1 2 1 2 1 2
1 6 2 1 2 0 1 2 1 2 1 2
1 5 2 1 2 0 1 2 2 2 2 2
1 4 2 1 2 0 1 2 2 2 2 2
1 3 2 1 2 0 1 2 2 2 2 2
1 2 0 0 1 0 1 2 2 2 2 2
1 1 0 0 2 0 2 2 2 2 2 2
If SEX-DETector was run on multiple families
The .vcf from the different families need to be combined into a single .vcf file. This is not straightforward, because occasionally there is a different minor allele in the reference, in different families. If the families are small, it is also possible for tri-allelic SNPs to appear bi-allelic but at different variants, in the different families, and these SNPs need to be manually removed.
Suppose the families are called m6f4 and m3f5.
Make headers for files
less m6f4_filtered.vcf
Manually copy the individual names to a text editor, replace gaps with tabs, paste in Excel, transpose, paste back to a text editor, select columns and add _1 to end of names, enter and paste again to add _2 to names, paste to excel to transpose. Save as m6f4_header.txt
. It should look like this (example with 3 individuals):
mR_1 mR_2 mM_1 mM_2 mI_1 mI_2
Make tped files
vcftools --plink-tped --vcf m6f4_filtered.vcf
Manually paste resulting tped file in Excel, remove columns 1, 3, 4 and add the header already made. Save as m6f4_SNPs.txt
. The first three lines look like this (for 3 individuals):
ID mR_1 mR_2 mM_1 mM_2 mI_1 mI_2
comp308_c0_seq1|m.78:300 G A G A G G
comp768_c0_seq1|m.214:275 T T T C T T
comp1073_c0_seq1|m.289:210 A A A G A A
Repeat for the other family (m5f3).
Once both files are ready, work in R
Read files
m5f3 <- read.table("m5f3_SNPs.txt", header=T)
str(m5f3)
m6f4 <- read.table("m6f4_SNPs.txt", header=T)
str(m6f4)
merged = merge(m6f4, m5f3, by.x="ID", by.y="ID", all=F)
write.table(merged, file="common_SNPs.txt", quote=F, row.names=F, sep="\t")
Markers with more than 2 alleles, or markers causing too long map distances, can be removed manually from the common_SNPs.txt
file
Replace the tab between two alleles with spaces. Manually change the code according to the number of individuals.
awk '{print $1,"\t",$2,"\ ",$3,"\t",$4,"\ ",$5,"\t",$6,"\ ",$7,"\t", sep="\t"}' common_SNPs.txt | perl -pe 's/ /zaba/g ; s/ \t /\t/g' | perl transposeTabDelimited.pl | perl -pe 's/zaba/ /g' > common_SNPs_reformatted.txt
Should result in something like this:
ID comp10005_c0_seq1|m.23150:473 comp10006_c0_seq1|m.4496:1446 comp10006_c0_seq1|m.4497:1008
mR_1 mR_2 T C C T G A
mM_1.x mM_2.x T C C T G A
mI_1 mI_2 T C C C G G
mG_1.x mG_2.x T C C C G G
Start making a pedigree file
cut -f1 common_SNPs_reformatted.txt > common_pedigree.txt
Manually turn into a pedigree file (the following, without the first column)
\#ignoreorigID family id father mother sex(1M,2F) phenotype(0)
R 1 5 1 2 1 0
M.x 1 6 1 2 1 0
I 1 7 1 2 1 0
G.x 1 8 1 2 1 0
F.x 1 9 1 2 1 0
E.x 1 10 1 2 1 0
D.x 1 11 1 2 1 0
C.x 1 12 1 2 1 0
A 1 13 1 2 1 0
Q 1 14 1 2 2 0
P 1 15 1 2 2 0
O 1 16 1 2 2 0
N 1 17 1 2 2 0
L.x 1 18 1 2 2 0
K 1 19 1 2 2 0
J.x 1 20 1 2 2 0
B 1 21 1 2 2 0
6 1 1 0 0 1 0
4 1 2 0 0 2 0
O 2 22 3 4 1 0
N 2 23 3 4 1 0
M.y 2 24 3 4 1 0
K 2 25 3 4 1 0
H 2 26 3 4 1 0
G.y 2 27 3 4 1 0
F.y 2 28 3 4 1 0
E.y 2 29 3 4 1 0
D.y 2 30 3 4 1 0
C.y 2 31 3 4 1 0
B 2 32 3 4 1 0
L.y 2 33 3 4 2 0
J.y 2 34 3 4 2 0
I 2 35 3 4 2 0
A 2 36 3 4 2 0
5 2 3 0 0 1 0
3 2 4 0 0 2 0
remove first line from common_SNPs_reformatted.txt
into new file common_marker_names.txt
.
combine the two into a .ped file
paste common_pedigree.txt <(cut -f 2- common_SNPs_reformatted.txt) > out.ped
make map file
perl transposeTabDelimited.pl common_marker_names.txt > out.map
manually turn into map file in the following format (in Excel, the last column is a number for each marker, this is the number the marker name is converted to in Lep-MAP).
1 comp10005_c0_seq1|m.23150:473 0 1
1 comp10006_c0_seq1|m.4496:1446 0 2
1 comp10006_c0_seq1|m.4497:1008 0 3
1 comp10006_c0_seq1|m.4497:1020 0 4
Download plink and use it on the out.map
file.
./plink --file out --recode12 --tab
Works! Makes a out.ped
file. Just replace -9 with 0 (phenotype column) and rename to combinedSNPs.linkage
.
Also make a file with just the marker names and the corresponding Lep-MAP number.
cat <(echo -e "ID\tid_number") <(cut -f2,4 out.map) > marker_names.txt
Lep-MAP
You can now run Lep-MAP. I put the files in a project_name folder in the Lep-MAP directory, hence the ../bin
part of the code, which differs to the Lep-MAP wiki.
java -cp ../bin Filtering data=myFile.linkage dataTolerance=0.01 > myFile_f.linkage
java -cp ../bin SeparateChromosomes data=myFile_f.linkage lodLimit=6 sizeLimit=4 > map1.txt
java -cp ../bin JoinSingles map1.txt data=myFile_f.linkage lodLimit=3 > map1_js.txt
for i in `seq 1 8`; do java -cp ../bin OrderMarkers data=myFile_f.linkage map=map1_js.txt chromosome=$i >order.SS$i.txt 2>order.SS$i.err & done
for i in `seq 1 8`; do java -cp ../bin OrderMarkers data=myFile_f.linkage map=map1_js.txt sexAveraged=1 chromosome=$i >order.SA$i.txt 2>order.SA$i.err & done
If you want to remove misbehaving markers, use removeMarkers=markername1 markername2
java -cp ../bin/ OrderMarkers map=map1_js.txt data=combinedSNPs._f.linkage sexAveraged=1 removeMarkers=5130 chromosome=8 > order8.SArm.txt
Combine all chromosome maps into one file, adding the filename in the first column of each line, remove lines without map location information, rename the column with the filename to just the LG number. Then make a file with only the information to use for mapping (LG, id and position (either common, male or female)) which is also added as a header. Do this for the sex specific and the sex averaged maps.
for i in $(ls | grep order | grep SA.txt); do awk '{print FILENAME,$0}' $i;done | grep -v COUNT | grep -v java | grep -v male | grep -v likelihood | perl -pe 's/order// ; s/.SA.txt /\t/' > LOD10_SA.txt
cat <(echo -e "LG\tid\tcpos") LOD10_SA.txt | cut -f 1-3 > LOD10_SA_data.txt
for i in $(ls | grep order | grep SS.txt); do awk '{print FILENAME,$0}' $i;done | grep -v COUNT | grep -v java | grep -v male | grep -v likelihood | perl -pe 's/order// ; s/.SS.txt /\t/' > LOD10_SS.txt
cat <(echo -e "LG\tid\tmpos\tfpos") LOD10_SS.txt | cut -f 1-4 > LOD10_SS_data.txt
Join the sex averaged and common files based on column 2 in both (id, file1 col2) and only keep LG (file1 col1), common position (file 1 col3), male position (file2 col3) and female position (file2 col4). If there are lines that are not joined, print NA. This only works if you first sort based on the columns to join by of the input files. Sort the final file by column 2 (LG) and then column 3 (common position).
join -1 2 -2 2 -o 1.2,1.1,1.3,2.3,2.4 -e NA <(sort -k2 LOD10_SA_data.txt) <(sort -k2 LOD10_SS_data.txt) | sort -k2 -k3 -n > LOD10_all.txt
Obtaining pictures
The aim is to make a .mct
file that is used to draw maps in MapChart.
Shorten marker names, and remove SNP position in a separate column. Then change header to have 3 columns.
perl -pe 's/comp// ; s/_c/_/ ; s/_seq/_/ ; s/\|m./_/ ; s/\:/\t/' marker_names.txt > marker_renames_temp.txt
cat <(echo -e "ID\tSNP\tid_number") <(tail -n +2 marker_renames_temp.txt) > marker_renames.txt
Combine columns to make male, female, and combined sexes map files.
join -1 1 -2 3 -o 2.1,1.2,1.3 <(sort -k1 LOD10_all.txt) <(sort -k3 marker_renames.txt) | sort -u | sort -k2 -k3 -n > LOD10_common.map
join -1 1 -2 3 -o 2.1,1.2,1.4 <(sort -k1 LOD10_all.txt) <(sort -k3 marker_renames.txt) | sort -u | sort -k2 -k3 -n > LOD10_male.map
join -1 1 -2 3 -o 2.1,1.2,1.5 <(sort -k1 LOD10_all.txt) <(sort -k3 marker_renames.txt) | sort -u | sort -k2 -k3 -n > LOD10_female.map
Suppose you have a file that lists contigs of interest. Each contig is represented in a line, and a second column has a new name for the contig to attract attention. For example the file is called findlist.txt
and looks like:
17869_0_1_9812 17869_0_1_9812_SEXLINKED
18630_0_1_10248 18630_0_1_10248_SEXLINKED
Replace the names of the contigs of interest in the .map
files with their attention grabbing name of the second column of the findlist.txt
file.
for file in $(ls *.map)
do
awk 'NR==FNR {a[$1]=$2;next} {for ( i in a) gsub(i,a[i])}1' findlist.txt $file > temp.txt
mv temp.txt $file
done
Manually turn each .map
file into a .mct
file.
- Separate the LGs by one empty line.
- Replace the _SEXLINKED with a single UB (underline bold) in the end of the line they appear.
- Remove column 2 which shows the LG number.
Add the following line before each LG (LG name, start position, end position)
chrom "Fem LG 1" S=0 E=maxLength
Save as
.mct
file and open in MapChart.
The .emf
pictures MapChart produces are best processed by
- Open in Microsoft Word
- save as pdf
- open pdf in Photoshop and combine in single picture
transposeTabDelimited.pl
#!/bin/perl -w
#
# SO 1729824
# Script found at http://stackoverflow.com/questions/1729824/transpose-a-file-in-bash
use strict;
my(%data); # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
my(@row) = split /\s+/;
my($colnum) = 0;
foreach my $val (@row)
{
$data{$rownum}{$colnum++} = $val;
}
$rownum++;
$maxcol = $colnum if $colnum > $maxcol;
}
my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
for (my $row = 0; $row < $maxrow; $row++)
{
printf "%s%s", ($row == 0) ? "" : "\t",
defined $data{$row}{$col} ? $data{$row}{$col} : "";
}
print "\n";
}