General Genomic Information

You’ll need to know which cutter you are using, so you can pull/use the appropriate barcode and splitting scripts. Basically, each sample/file will end up labeled with a Biotin Plate barcode, and an individual Well barcode.

These will look something like (for Sbf1):



The two cutters we primarily use are:

  • Six cutter (5' - CTGCA_G - 3') Pst1
  • Eight cutter (5' - CCTGCA_GG - 3') Sbf1

Q Phred Quality scores

Initial files are .fastq because they contain Q quality scores.

  • there are 4 lines per each individuals sequence
  • score is series of characters->numbers->CAPletters->lowercaseletters lowest->highest
  • Q represents the quality score of nucleotides generated by automated DNA seq
    • Q = -10log10 P where P = base calling error probability
    • 50 would be 99.999%; 20 would be 99%; 10 would be 90%
    • Most common is +33

Add Metadata/SiteNames (Optional)

If you want to add metadata to your filenames, you can do that here, but it is optional. A metadata file should contains the name of each individual, its location on the plate, etc. This part can be done now or later. But be absolutely sure that your indiv names, their plate location (A1, B1, etc.), and the appropriate barcode are all in sync. For example, the barcodes might be A1 A2 A3 (Across the plate row) while your sample names are A1 B1 C1 (Down the plate column)

# use this script to remove the barcode from bamfile and replace with sample/indiv name

pop=$1 # this is the metadata file with well number, old label, new label, description, etc (can include header)
n=$ (wc -l ${pop} | awk '{print $1}')
x=2 # change to 1 if no header

## another option for 3 lines above:
#while [ $x -le 96 ] #This can be adjusted based on number of files

while [ $x -le ${n} ] 

        string="sed -n ${x}p file" # represents whatever metadata file contains columns of barcodes/names

        var=$(echo $str | awk -F"\t" '{print $1, $2, $3, $4, $5}')   
        set -- $var
        c1=$1 ## Well Number
        c2=$2 ## Old Labels
        c3=$3 ## New Labels
        c4=$4 ## Description
        c5=$5 ## Barcode
        # need to fix these two lines to match your specific fastq file name and metadata cols
        mv SOMM065_RA_GG${c5}TGCAGG.fastq ${c3}_RA.fastq
        mv SOMM065_RB_GG${c5}TGCAGG.fastq ${c3}_RB.fastq

        x=$(( $x + 1 )) # loop to next line

  1. Replace barcode with name of Indiv using Metadata file and script
    • sbatch -t 2880 -p high Barcode_to_Name metadatafile 2880 is the number of minutes
  2. Concatenate each individual’s files into one RA, RB per indiv (if multiple).
    • sbatch -t 2880 -p high You might need to hardcode the metadata file into this script.
  3. Keep concatenated files, remove all others. They should be something like NAME1_RA.fastq and NAME1_RB.fastq