Molecular Pilot Studies Well Along
Molecular Project Update #1
14 October 2006
Sequencing has been underway since before the 1 January 2006 start date. From three to 26 genes are now in progress or nearing completion for 195 species. These exemplars are spread across most of the superfamilies (see status matrix).
Our initial efforts focus on several pilot studies aimed at optimizing the design for the bulk of the study. The problem we face can be posed as follows. Leptree has been given sizeable but finite resources to get a first estimate phylogeny across the lepidopteran families and superfamilies. How do we spread our effort across genes and taxa to get the greatest return on this investment? To answer, we first need to define “greatest return” more specifically.
A reasonable criterion would be something like this: by the end of the project, we want to maximize a quantity consisting of the total number of strongly-resolved nodes, with each node weighted by its inclusiveness, possibly by other aspects of “importance.” An example of the latter might be a node which, though not at a high taxonomic level, is critical to mapping the origins of a feature of very general interest, such as the ultra-sound-detecting “ears” seen in a number of families.
There’s really no way to know the best design in advance, though it is very unlikely to be homogeneous sampling of genes or exemplars across families. Thus, empirical exploration is needed.
For example, in some groups we probably won’t need 26 genes to get strong resolution; doing fewer genes could thus conserve resources to be used on other problems. Conversely, for other tree regions it could become obvious early on that even 26 genes isn’t going to provide strong resolution. We might therefore want to classify these as “too hard for now” and move on, rather than let such a problem eat up a disproportionate share of resources. But, how much initial evidence is required to distinguish “easy” from “hard” problems?
In all parts of the tree, we face the much-discussed question of the optimal ratio of gene to exemplar sampling. In addition, while we have strong preliminary evidence that all of our gene regions will provide useful “signal” in Lepidoptera, most have not been applied previously to systematics, so we don’t know much about their relative utility at different depths in the phylogeny.
To answer these questions, we’ve got three pilot studies underway in different parts of the tree, assessing the relative effectiveness of individual genes, and of gene versus taxon sampling. We’ll use the results to pick a best-guess gene and exemplar sampling plan for a first pass across the families (sampling most subfamilies), then focus additional sampling where it will help the most.
Two of the pilot studies build on a previous project, in which we sequenced about 70 species representing all families and most subfamilies of Bombycoidea, plus all but one of the other superfamilies of Macrolepidoptera, for four or five of our genes totaling about 6000 bp. (To see exactly which genes and taxa are involved, go to the ‘Set’ column in the Status Matrix and look entries marked A and A’ for bombycoids and B for macroleps.) Unfortunately, neither bombycoid nor macrolepidopteran relationships look like easy problems. Families and superfamilies are mostly strongly supported by these data, but relationships among them mostly are not.
In the search for increased resolution, we are, first, sequencing the other 21 genes in selected bombycoids (taxa marked A’ in the Status Matrix). The bombycoids are already relatively densely sampled, so this will be mainly a test of the effect of adding more genes.
Second, using just the initial five genes, we are greatly expanding taxon sampling, across the macroleps and putative outgroups (B’ taxa in the Status Matrix). This will be a test of the efficacy of increased taxon sampling, with gene sampling held constant.
In the third pilot project (taxon set C in the Status Matrix) we are sequencing all 26 genes for 32 species spread across the lower Ditrysia, to better characterize the relative utility of the individual genes at greater depths in the phylogeny.
We hope to complete sequencing for the pilot projects within a month, and to complete data set assembly and initial analysis by early next year. In the meantime, as the latter steps are very time-consuming, we have begun the survey across families, using three gene regions judged from extensive previous evidence to be safe bets for inclusion in the first-pass gene sample. These are CAD, DDC and Period. The exemplars in progress so far are labeled C’, D,E and F in the Status Matrix.
