Meet the Phyllo Phages
We in the lab of Dr. Delesalle and Dr. Krukonis have spent much of our summer getting closely acquainted with a set of six viruses informally dubbed the Phyllo phages. No, you will not find these tiny parasites infecting the bacteria inhabiting your baklava, though it’s not a coincidence that they share a name with flaky filo. Both words are derived from the Greek φύλλο, meaning leaf; as the name hints, these phages were found on the leaves of a horse chestnut tree, and were isolated by Dr. Britt Koskella at the University of Exeter. Different from the phages of Bacillus subtilis and Mycobacterium smegmatis that our lab has worked with in the past, the Phyllo phages infect bacteria of the class Gamma Proteobacteria, including genera Erwinia and Pseudomonas, among them the causative pathogens of tree-diseases like fire blight and bleeding canker. As part of our collaboration with Dr. Koskella, we are tasked with assembling and annotating the Phyllo phages.
Putting the puzzle together
When a genome is sequenced, it is not sequenced in one long piece. Rather, it is sequenced in many small pieces that are each a few hundred base pairs long. We can get up to a million of these short sequences throughout the phage genome, and a program called GS de Novo Assembler, or Newbler, is used to align the overlapping sections of these reads into one or a few long, contiguous sequences. This information is viewed and edited in Consed. We look to join contigs, extend or shorten the sequence, and eliminate contigs that do not belong to the phage until we have the whole genome and only the genome.
We then determine what genome ends it has – phage genomes can be circular, linear, or have terminal repeats. PAUSE (Pile-up Analysis Using Starts & Ends) is an online program that can help determine end types and where terminal repeats would be. Once we get to this stage, we have a complete genome that is ready for further analysis. Within our sample set, the genome lengths range from less than 40,000 to greater than 260,000 base pairs, attesting to the great diversity of bacteriophage as a group. However, genome length alone cannot tell us much, and thus it is necessary to annotate each genome and categorize its features in context of previously characterized phage.
Annotation involves using a set of databases and programs to make predictions about the size and function of genes. DNA Master, developed by Dr. Jeffrey Lawrence for the annotation of bacterial genomes, is central to our work, as it allows us to visualize and make notes on individual genes. For each open reading frame (i.e. a string of nucleotides between a start and a stop codon), we first determine whether or not it is a potential gene by analyzing its coding potential, a metric easily visualized using web-based GeneMark. Moving forward, we pick through possible start codons, looking at the gap/overlap with the adjacent gene and the proximity to a ribosomal binding site. In addition to the nucleotide analysis, the corresponding amino acid sequence for each gene is run through online tools to predict protein functions. BLAST (Basic Local Alignment Search Tool) predicts the function of a query sequence by linear comparison to the NCBI reference sequence database, which contains upward of 52,000,000 protein sequences. The other program we use for protein prediction is HHpred, which draws from several smaller databases and performs a similar alignment, but also takes into account biochemical properties of the amino acids when determining structural similarity. Because there is such a wide variety of phages and they are not very widely studied, we are not yet able to assign a function to many genes. A complete annotation, usually the result of a week or more of laborious data analysis, is comprised of a list of genes, each with precise position in the genome and a clearly noted function, if there is a function at all.