Apoditrysia analysis
APODITRYSIA PILOT STUDY ANALYSIS
This study is an exploration of what happens if, using a limited but not unreasonable taxon sample, we direct more genes than anyone ever has previously done at possibly the hardest problem in lepidopteran systematics (lower ditrysian relationships). How much resolution do we achieve, and what can we learn that would help us to improve resolution in the future? [If the paper were to include the obtectomera study, the aim could be broadened to an initial try at the hardest problem, ditrysian phylogeny, using two complementary approaches - two very different ratios of taxon versus gene sampling.]
Here are some initial thoughts about further analyses, organized around a series of questions that we might want to pose in a publication.
1. How much did we learn about lower ditrysian phylogeny?
In addition to the ML + bootstrap analyses, we could perhaps use the Approximately Unbiased test of Shimodaira to test whether the data provide significant support for some of the novel hypotheses that our analyses appear to support, such as:
non-monophyly of Zygaenoidea sensu lato;
non-monophyly of Apoditrysia (refers to position of Gelechioidea);
non-monophyly of Sesioidea (=Sesiidae + Castniidae).
2. Which and how many genes support the groupings that are resolved to any plausible degree [following up on Jerry’s suggestion; in this case, I would say “plausible” equals >50% bootstrap]? Do conflicts between individual genes correlate with any part of the lack of resolution?
3. To what degree can the resolution of individual nodes in the full data set be predicted from a quasi-independent estimate, based on analysis of five genes only from the same taxa, of the ratio of terminal to internal branch lengths? This would parallel an analyses to be done for the bombycoid data set. It would also provide a measure of stability of groupings with change in gene sample.
4. How do the individual genes vary in properties potentially affecting phylogenetic utility, including tractability for amplification and sequencing, base composition and heterogeneity, and in actual contribution to phylogeny resolution? These analyses would be directed at helping to guide gene choice in future studies.
If we were to include the Obtectomera in the same publication, I would see adding tests of additional specific hypotheses, including:
non-monophyly of Drepanoidea;
non-monophyly of Macrolepidoptera.
Apoditrysia+Obtectomera
I support Charlie’s suggestion to combine both the Apoditrysian and Obtectomeran projects into a single manuscript. I think the comparison between to two data sets is an important avenue to explore, in addition to discussing relationships within the groups. This would offset to concerns I have with the lack of a Pyraloidea specimen in the Apoditrysian data set.
Comments from Amanda and Susan: prologue
After the last group meeting, Susan and Amanda had a number of concerns, both about the meeting itself, and the conclusions reached. They were kind enough to offer me the chance to address these privately. After an initial exchange with them, however, I felt that it would be more productive to bring this discussion to the group. With their permission I will provide a synopsis and extended excerpt of their comments, and subsequently (at some point) give my reply. I am sincerely grateful to these esteemed colleagues for investing the time and emotional energy to candidly express their concerns to me, which can’t have been easy inasmuch as the issues come down in part to things that I should have done, and I believe that discussing them will help us make progress. I myself have already had a number of new thoughts about the project as a result of considering their views.
Comments from Susan and Amanda
In a very nice cover letter, Susan said that she and Amanda had found the last meeting somewhat frustrating, in part because it was difficult to hear everybody, and to follow a fast-paced, complex conversation involving so many people without visual cues. In contrast, the people most directly involved in the molecular study were all in the same room. She hoped that they had not inadvertently offended anyone by sometimes jumping into pauses in the conversation, in the attempt to make sure they were participating. (I assured her I saw no evidence of this.) She then shared with me a statement of post-meeting thoughts that Amanda and she had drafted. I thought it would be appropriate to reproduce most of that draft here.
After posting this, I will give Amanda and Susan a chance to object to anything i’ve said so far, before giving my response.
“Kudos to Jerry and everyone else for getting so far on the data compilation and analytical results on the pilot studies. It’s great to know that we have three very impressive datasets to work with. We have still been absorbing the results from the meeting and will be discussing the recently sent Obtectomera results before the next.
For our next meeting, we have two agenda items that we wish to add. These items did not occur to us until after our phone conversation. First, we ask that an agenda be circulated ahead of time for all meetings and second, that we revisit the Apoditrysia sampling scheme.
1. The need for an agenda. The speaker phone is an adequate but imperfect communication device. Every meeting, we struggle to hear most people speak. We are trying to hear, understand, and provide feedback under less than ideal circumstances. The last meeting was particularly frustrating, because we were surprised by the focus on circumscribing manuscripts and were unprepared for the direction of the discussion. Susan has reviewed all communications from Charlie regarding the agenda for the 15 August meeting. None of them say anything about manuscript generation. We had been under the impression that our conversation was to decide how to proceed next with the study, not to vote that data sets were complete and discuss how to write them up.
We have had [time] to review the data and the outcome of the conversations after the fact. We have come to the following conclusions (which leads to item #2 for the agenda):
Bombycoid study. We agree that the Bombycoidea dataset presents a great opportunity for an analytical manuscript, and whole heartedly support moving forward on that project. We are comfortable with the vocal support we provided last meeting on this manuscript.
Apoditrysia study. We are less enthusiastic about our initial support of writing up the apoditrysia study. At the time, we agreed that holding up a paper to add one more species would be unwise, even if it was the only representative of the 4th largest superfamily of Lepidoptera, Pyraloidea. During the discussion, it was stated that none of the Obtectomera had been completed for more than 5 genes. At the time, we were convinced by the arguments that there was a plethora of data and it was time to get moving on the papers.
After the conversation with MD ended, we pulled the taxon status matrix from the Leptree website – just as a reviewer of our paper would do – to ask how many genes were currently available for pyraloids. Our concern is that a savvy reviewer would go through the same process we did, see this unexplained hole in the data set, and reject the paper based on the fact that pyraloids are available – but not included in our taxon sampling. Missing Pyraloidea, in our estimation, is a huge Achilles’ heel of this data set.
Here is what we see, based on the website:
Seven taxa included in this data set lack the full gene compliment. As indicated on the tree and website: Aididae, and 6 Tortricidae (Cnsp,Crs, Anne, Aesp, Basp, Eusp) are incomplete. Aididae is missing 7 genes, while the six tortricids have only 5 genes. According to the website, 4 pyraloid specimens are as complete as the Aididae species (missing only 7 genes), and all have more genes sequenced (are more complete) than the included tortricid species.
We understand that the status matrix may not reflect the actual status of the sequences (in progress covers too many stages), which is why we would like clarification before our next meeting about the pyraloid sequences (i.e., How far along is any one species?) We recommend that the distinction between “in progress” and “complete” be added to the status matrix to avoid confusing us and reviewers. Clearly Aidids must be “complete” for all genes included, yet they are coded the same color as the pyraloids which are not.
Again, it is our opinion that even if this dataset is spun as a methodological paper, the lack of the pyraloid representative is problematic. We understand the urgency of getting these studies written up, but we feel that judicious addition of one species would greatly improve the data set and we risk having the paper rejected if we do not. If we absolutely cannot add one, then we will have to deal with their absence and justify including torts with only five genes. Right now, we are at a loss as to how to justify this sampling scheme.
We look forward to further discussions about how to proceed with these studies. We will be reviewing the Obtectomera analyses sent late last week. Susan has a student’s preliminary oral exam Wednesday (12:30-3:30) and will arrive late, consequently. She will write up her thoughts and send them ahead of time.
Sincerely,
Amanda and Susan
Current Apoditrysia Study sampling scheme:
Macrolepidoptera
Bombycoidea/Brahmaeidae
Lasiocampoidea/Lasiocampidae
Geometroidea/Geometridae
Noctuoidea/Noctuidae
Mimallonoidea/Mimallonidae
Obtectomera (Non-Macrolepidoptera)
None
Apoditryisa (non-Obtectomera)
Zygaenoidea/Limacodidae, Dalceridae, Lacturidae, Zygaenidae,
Megalopygidae, Aididae, Cyclotornidae, Epipyropidae
Sesioidea/Castniidae, Sesiidae
Cossoidea/Cossidae
Torticoidea/Tortricidae
Pterophoroidea/Pterophoidae
Alucitoidea/Alucitidae
Choreutoidea/Choreutidae
Ditrysia/Non-Apoditrysia
Gelechioidea/Gelechiidae, Cosmopterigidae
Yponomeutoidea/Yponomeutidae
Gracillarioidea/Gracillaridae
Tineoidea/Tineidae”
pyraloid sequences?
I acknowledge Susan and Amanda’s point as to what a savy reviewer for Systematic Biology might say about taxon sampling, and, indeed, I probably agree. One solution would be to avoid Systematic Biology and go for a “lesser” (definitely in quotes) journal, so that we can avoid wasting time and effort redoing what’s already been accomplished.
status matrix
I agree about adding another status code between in progress and complete. We had originally planned for “complete” to mean available in GenBank. Just let me know what the new status should be called and it shouldn’t be difficult to add it.
The use of the status matrix by a reviewer in this way would be, in my mind, pretty impressive. If discrepancies are noted, we can always note that the status matrix is always no more than a week out of date from the lab (right?), while analyses included only data available by a certain date. Or something like that.
Missing pyraloids
I actually don’t agree about the absence of pyraloids from the apoditrysian analysis being a big problem. There is a reason that the sampling plan we discussed and put on the web last year did not include them. This study focuses on basal ditrysian divergences. From this point of view,obtectomerans are just one lineage among many; all we need from them is a plausible sample. The situation is quite analogous to our inclusion of only one or two ditrysians in multiple papers we published about basal lepidopteran relationships. No reviewer ever complained about this. this is why i don’t think not having a pyraloid is a major issue for this analysis. Our sample of obtectomerans is quite reasonable, if imperfect, for the aims of this particular study. And, it has the advantage of having 26 genes. i think reviewers can understand this. i wouldn’t let this issue affect our journal choice.
Study sets text/status matrix mismatch
After spending almost two hours with Amanda reviewing what was done and said, I can only add that I/we understood the following:
Set A = Bombycoids
Set B = Obtect/Macrolep
Set C = lower Ditrysia (a.k.a. Apoditrysia + outgroups)
I had assumed, incorrectly, that all the studies built on one another. That is, a subset of taxa from datasets A & B would be included in the analysis of set C. It escaped both of us that pyraloids were not part of set C. Also, the status matrix does not indicate which taxa from these two sets are currently included in C.
Upon rereading Molecular progress memo 1 from last October, I now see where the assertion originates that we all “knew” the sampling scheme for all 3 studies. Unfortunately, it doesn’t change the reality of how we interpreted the original memo and discussion.
The results of the pilot study Obtectomeran (5 genes) support the interpretation that 1) we can recover families and some (many) superfamilies, 2) apo/obtecto/macrolep categories are possibly artificial constructs, lacking any significant support for any of these groupings.
If the focus/scientific question for all three studies is not phylogeny, then we are fine. We can demonstrate that among superfamily relationships cannot be recovered whether it is 5 genes or 26 genes. I agree with Jerry that I don’t see the apoditrysia as a stand-alone in syst bio, consequently.
A note on the most recent dataset- Bombycoids and lasiocampids represent nearly 1/3 of the taxa. We recommend that taxon sampling of these lineages should be reduced to reflect the sampling of comparable superfamilies (Zygaenoids, Papilionoids, Tortricoids). Given the species’ richness, Bombycoids should be at the same proportion as Gracillariids if the question focuses on the reality of the higher categories of Minet (i.e., Obtectomera, Apoditrysia, etc).
Amanda and I would like to see the results of such an analysis before we try to discuss writing up the Obtectomeran data set. Perhaps we could have those results by the time everyone is back from Berlin.
Sorry this comment is so long - congrats on making it through this missive!
when to exclude taxa from an analysis
In her foregoing comment, Susan has made an interesting proposal about taxon sampling. As I read it, she is asserting that we should prune taxa from the Obtectomera data set until superfamilies are represented proportionally to their species diversities, because this will result in more accurate phylogenetic inference. This is a new idea to me, and I have found it very useful to contemplate the questions that arise from it.
However, I can’t think of any reason why Susan’s assertion should be true, or of any empirical evidence that supports it. I am therefore sincerely hoping that others can set me straight.
To be sure, I can think of two reasons why a reduced taxon set would sometimes give a better result. One is that having too many taxa can impede the search for the optimal tree. That would be a strong reason to cut down the representation of bombycoids, and in fact Jerry already done this for calculation of the best ML trees for the Obtectomera; the new number of bombycoids + lasiocampids is 20, as you will see from the new outputs to be sent out.
The other reason is that tree structure can sometimes be disrupted by adding a taxon that is very isolated from the remaining sample, for example a very distant outgroup. In this circumstance one might prefer not to include a taxon for which the data have been gathered.
However, I do not see any reason why removing data for “over-represented” groups should improve phylogenetic accuracy. In fact, I would expect the opposite.
Here is why.
The best deep-level phylogeny should result if the groundplan of each of the constituent lineages - say superfamilies - has been estimated as accurately as possible.
Accuracy of groundplan estimates, in turn, will depend on the adequacy of sampling within superfamilies.
I do not see how potentially decreasing the accuracy of groundplan estimation for some superfamilies, by leaving out some of our data, would increase the accuracy of among-superfamily phylogeny estimation. For example,I don’t see how removing taxa from one superfamily could improve the groundplan estimates for others.
Instead, I would expect that overall accuracy would be maximized by getting as much information about each individual superfamily groundplan - through taxon sampling- as we can.
Thus, I predict that more taxon sampling should in general be better, even if “disproportionately” spread across superfamilies.
I also don’t think it would be easy to figure out what proportional representation should mean. Species diversity is not the only type of diversity difference among superfamilies which could affect phylogeny inference. For example, superfamilies can also differ in their average level of character state divergence. So, it would be difficult to defend any particular sub-sampling scheme on these grounds; one would need to experiment.
In sum, I can see doing experiments on the effect of deleting data from “over-represented” superfamilies, but at present I don’t see a reason why such sub-sampling should be a guiding principle for our analyses.
I am not asserting that there aren’t preferable ways to distribute sampling effort across superfamilies. In general, we will want more samples from the ones with greater internal divergence. But we won’t always be able to get as many as we want of some groups, and for various reasons we may have more than we “need” for others. This, by itself, should not be a reason for pruning taxa.
Thoughts from others?

Highly unequal gene numbers may reduce resolution
I compared the resolution within Tortricidae between the Apoditrysia study, where two of nine exemplars are sequenced for 27 genes, the rest for five genes, and the Obtectomera study, where all taxa are restricted to five genes only.
In the Apoditrysia analysis,there are three tortricid nodes with BP of 84, 56 and 59% respectively. In the Obtectomera study (5 genes only), these same nodes have BP of 99, 81 and 73.
This suggests to me that we need to look more closely into the effect of unequal gene sampling. What fraction of taxa have to have large gene samples, for there to be positive instead of negative effects on resolution of increased gene sampling?