Mechanical Turk

Yesterday morning, my last day at the Lake Arrowhead Microbial Genomics Conference, I saw a tweet from Holly Bik (@Dr_Bik) about a talk she was attending at a Phenomics conference about the sociology of Amazon’s Mechanical Turk Web Service. What is Mechanical Turk, you may ask? Well, what’s really funny is that just minutes before I had answered that question for Ben Tully (@phantomBugs) in describing how I used it to help with my dissertation research.

Mechanical Turk allows you to crowd source little tasks that are easy for humans, but not computers. For example, if you need to write a short caption for 1000 photos at about 1 minute per photo, that would take you about 2 full days of work. Or, you can upload those photos to Mechanical Turk, along with some instructions about how to write each caption. Each photo becomes a little job, sent out to all the workers on Mechanical Turk. You offer to pay $.03 per job, and then you sit back with a glass of wine and watch the World Series of Poker. Some hours later, all of your work is done, and you did none of it. Sure, you are out $30, but hopefully your time is worth more than $15/day.

You are provided with the worker ID for each job. You can spot-check each worker’s work and if you do not like it, you can reject all of their work, do not pay them, and then those photos go back into the work queue

So, how did I use this for my dissertation research? I wanted to look at environmental correlates of horizontal gene transfer (HGT). HGT is the exchange of DNA between different species, and it is fairly common among microbes. One potentially important mechanism of DNA transfer is the uptake by a cell of DNA that is floating about freely in the environment (transformation). If the environment is inhospitable to the DNA molecule, then the probability of transfer by transformation should be quite low. For example, and in particular, I wanted to ask whether organisms that live in very low pH environments experience a lower incidence of HGT.

I can predict the incidence of HGT for an organism directly from the genome sequence, so all I need is to find out at what pH that organism grows. That should be straightforward because every time a genome is submitted to a public database, the submitter will include all of the associated environmental data (or metadata) that is available, and since that submitter grew the organism in culture in the laboratory, he or she must know at which pH it best grows. Right?

Now, because I am who I am, I want to do this analysis in a phylogenetic context, using some Phylogenetic Comparative Method (I should talk about this more in another post.) In particular, I opted to use Felsenstein’s Independent Contrast method as implemented in Phylocom. I built a reference phylogeny for ~800 bacteria and archaea which have genome sequences available (this part is currently in revision) and then I “looked up” the optimum growth pH for each of them. This look up process should be straightforward, too, because the data are submitted to a searchable database, like at NCBI or the JGI’s IMG. Right?

Well, nothing that should be straightforward ever is in my academic life. I was able to get the pH data from the IMG by grabbing all of the webages with the metadata on them and parsing them with a little perl script. But when I did that, I only got pH data for ~100 of the ~800 organisms in my reference phylogeny. For any given organism, if I spent a couple of minutes poking around in the literature, I could easily find the optimum growth pH, so it’s not like it’s not out there. But, I couldn’t automate the “poking around” process, and after spending two full days of work, I had only retrieved pH data for about 200 additional organisms. Because I was down to the wire in terms of a dissertation submission deadline, and because I consider my time fairly valuable, I just couldn’t bring myself to keep at it.

I was complaining vociferously about those ~600 people who couldn’t be bothered to spend their couple of minutes to include pH data with their genome submissions. Russell Neches (@ryneches), who seemed to really get a kick out of my uncharacteristic vociferousness, suggested that Mechanical Turk might work for me.

I created a job (called a HIT): “Given the name of an organism, report it’s optimum growth pH”

My job looked like this:

In retrospect, maybe I could have come up with some better instructions, but I think for most things, these should work. I paid $.08 per job, so I spent $40. I rejected a lot of work because I got answers like: 37.5 or comments like “I couldn’t find it.” There was one worker who commented, “This was a cool HIT.” And that worker did a lot of jobs and seemed to do good work. I focused my spot-checking on values at the extremes, since most things live at a neutral pH.

But, here’s what I found: If you follow the first approach that I suggest and google the organism name+optimum+pH+growth, sometimes, you would see something that looked right in the google search results, but was actually referring to enzyme activity rather than growth conditions.

I can tell that this is not the information I’m looking for, but I don’t really expect a Mechanical Turk worker to be so discerning, especially not for $.08. There were a few other common errors that I think it would have been difficult to avoid, even with more clear instructions. I could have submitted some of the jobs in duplicate so that I could check the workers against each other, but I suspect that this type of mistake would be made by anyone without a fair degree of specialized knowledge on the subject (who is not likely to be doing this sort of work.)

So, in the end, I ended up spending a lot of time double-checking the results. I don’t know if I saved any time by doing it this way, but it was FUN! I will definitely keep it in the back of my mind, and hope to be able to use it again someday!

Fly Hunt: Chapel Hill

Corbin Jones, at UNC was ridiculously generous with his scarce free time while I was in town. He took me and Shelah on a hike through the forest near his house. It was a very cool place. We walked a couple of miles (maybe?) on this loop through the forest, setting out bait along the way. Early the next morning, Corbin and I went back through to collect the traps. Unfortunately, it had rained quite hard the night before, so most of the traps were flooded and useless. Fortunately, it had rained quite hard the night before, so I was witness to a lovely thunderstorm. For those who don’t know, the San Francisco Bay Area does not have thunderstorms. I see a streak of lightning about once a year here, and I’m not sure that I’ve ever heard thunder. I took a video of the storm with my camera. I love how the thunder goes on forever. I turned out all of the lights, opened my doors, and enjoyed the storm with my hookah.

Anyway, Corbin and I collected the traps and we also found a ton of drosophila feeding on some mushrooms. I’m posting a bunch of pictures of them here, so that someone will tell me what kind of mushrooms these are. The little black insect on the underside are some kind of wasp, but the red-eyed buggers on top are my Drosophila.

I also couldn’t get over the poison ivy in this forest. I grew up around poison ivy, but I’ve never seen anything like this. Apparently, it grows like a giant, red, hairy vine all over these trees. Dead vines were everywhere!

I also saw a lot of really cool insects here, as well as a lizard with a blue tail and a copperhead snake! (I know that’s not an insect there on the right, by the way.)

There were two insects that were really bizarre. One of them is this black freaky thing on the ground (below left) and one was this cryptic little bug that I saw getting attacked by a spider when I was trying to take a close-up picture of the hairy poison ivy. The spider kept running at him and he kept waving his front legs around in the air at the spider. His legs looked like medieval weapons, like maces. Anybody know what these are?

NESCent – play

I’m pretty sure that I would not be welcome back at the Eisen lab if I wasn’t the instigator of at least one social gathering during this ten-day compu-phylo-info-mindo-bendo-juggernaut. So, I asked Bill to announce on Friday that I was hosting a party that evening at my apartment. BYOB, I’ll bring the hookah!! Here are some pictures from the party. If any NESCent folks are reading this and want more pictures, I’ll post them all on Bill’s NESCent web page soon, or I’ll send you a link to an online photo album.
I had a hard time deciding whether or not I should admit that I can’t remember everyone’s name. Well, I can’t. Sorry! I’m guessing in some cases, so someone please correct me if I get it wrong.

Here, left to right, we have Colleen, Derrick, Karen, Kathleen, and Omar.

Santiago is showing Shelah the cheating way to blow smoke rings with the hookah. Then, he shows off..

Rutger regales Bill with tales of European history.

Taika and Joseph marvel at how intelligent the Americans are.

Jason and I smokin’ hookah with Lauren.

Taika and Libby.

I hosted a second party. It wasn’t as well-attended, but by this time we were all more comfortable with each other, so it was fun. Some of us were more comfortable than others…

Rutger tried to fulfill Libby’s Peter Murphy fantasy.

NESCent – work

I’m back in the Bay Area now. Exhausted, but safe and sound. The phyloinformatics course at NESCent was great. Dave Maddison was our instructor for the first two days. He gave us a nice overview of Mesquite‘s capabilities. At first, I was pretty excited – thinking that I was going to be writing new modules for Mesquite to do all sorts of cool things – but I know now that I probably won’t. Mesquite modules are written in Java, and I don’t know Java, and based on what I heard from the folks in the Java tract of this course, it’s not straightforward to write these modules even if you do know Java. But, Mesquite is pretty powerful as it is, so I’m glad to know more about it, and creating macros in Mesquite is very easy, and it can be run from the command line (although I wish we’d spent more time on this) so I will definitely keep it in mind for future analyses. Also, we used Eclipse to run Mesquite, so I was introduced to that as well.

For the next two days, Sergei Pond and Spencer Muse introduced us to HYPHY. We learned how to partition a multiple sequence alignment, define substitution models, perform relative rate tests, and simulate trees for bootstrapping. Sergei introduced us to the HYPHY batch programming language at the end of the first day. The second day, I took off. I really needed a day off at this point, and since we were in class from 9am-9pm every day for ten days, I had to just pick a day, and this was it. So, I missed out on the nuts and bolts of the HYPHY batch language, but I feel comfortable enough with HYPHY now to use it when I need it.

Then, the class split into two tracts, Java and Perl. The Java folks went off to do whatever it is that Java people do 😉 and Jason Stajich and Rutger Vos spent the next four days with us installing BioPerl on our laptops.

Just kidding! It didn’t take four days to do the installation. But, it was a headache. I’ve spent days trying to get BioPerl running on my computer in the past, so I must admit to the sweet, sweet schadenfreude I felt watching the experts struggle with it. Finally, though, I can use BioPerl modules in my perl scripts – yay! Jason walked us through many of the features in BioPerl. Rutger introduced us to Bio::Phylo, which is a powerful tool for analyzing/manipulating phylogenetic trees. Invaluable, all of it.

For the final two days of the course, Hilmar Lapp and Bill Piel, the course organizers, taught us how to use PostgreSQL and how to use the DBI module to interface with databases in our perl scripts.

There’s no way that in ten days I could walk away from this course a fully-functional bioinformatician, but I really do feel more empowered to become one. And, my future efforts to write the kinds of analysis pipelines that I’ve struggled with in the past are going to be much more efficient.


Yesterday, I met with Kathryn Radke, the master graduate advisor, who was very helpful in helping me sort out some details. First, she informed me that the Micro web site was completely wrong about the elective requirements, so instead of having to take 3 elective courses, I only have to take one. Nice. That makes my coursework requirement quite minimal: a total of five courses. I could easily finish my coursework in the first year and then start working on the Master’s in Biostatistics. We’ll see. We also talked about the possibility that I might be allowed to enroll as a part-time student. This would be nice because then it’s much less hassle with the Human Resources department when it comes to paying my tuition. Also, it just makes sense, since I’ll technically have a full-time job. I wrote a request that Dr. Radke has offered to take to “the board” or whoever, and she said that she would recommend that they approve my request.

I’ve also asked to have one of the core course requirements waived because it covers exactly what I’ve been doing for the last five+ years. As Jonathan said, I could teach the class! Also, it’s only offered at 7:30 am!! But, it’s our little secret that that is a big concern for me…

Today, I’m meeting with Scott Dawson. I thought I’d be meeting with David Begun, too, but he hasn’t gotten back to me about a specific time. I may just drop by, since I’ll be in town already.

more on rotations

It’s funny to me that I would have been perfectly happy not doing rotations at all, but since I am, you’d think I was choosing a partner for life. Four times! Nevertheless, here’s an update: I met with Ian Korf, and I’ve emailed Scott Dawson and David Begun. Next week, I’ll meet with Scott and David to talk about possible projects.

So, that leaves one more. Katie Pollard, I think. I haven’t talked to her yet. I’m not sure why not. I’m also thinking that I want to get a Master’s in Biostatistics while I’m at it. I can’t decide if that means I should definitely do a rotation with a statistician or if should do one with a microbiologist since I’ll be getting plenty of statistics. I’m thinking I should talk to Mitch Singer and/or David Mills and/or Kate Scow. Yeah, so I guess I haven’t really narrowed it down that much. Crap. I may be thinking too much about this.

droppin’ like names

I’m planning a trip this summer that I’ve dubbed the “Fly Hunt.” Really, I’m going to NESCent to take a 10-day course on Phyloinformatics. I figure if I have to go from the west coast to the east coast and back, that I should drive. I’m aware (only because I’m often told) that this is an unusual response to a need to travel 6000 miles, but I feel lucky to have the opportunity. I’m going to collect flies (Drosophila) along the way.

So, I’m looking for people along the way who will take me on a local collecting trip. I don’t really know one species of fly from another, and if not an orchard, winery, or fruit stand, I don’t know where to find them. So, I’ll need help. Also, I have a portable dissecting microscope, and I hope to have every other necessary item with me in the field, but it’ll be nice to have a fly lab that I can make use of for the day.

Yesterday, I met with Michael Turelli who was so incredibly helpful in helping me find drosopholists who can help me in the field. I have a good list of names, and I will be dropping his name when I’m asking these folks for help. I amazed at how few people know anything about the natural history of the fruit fly. It seems to me that if you want to study insect ecology, well, you might as well study the ecology of the insect about which we know more than any other! I don’t think they’re completely intractable as field subjects, they are cosmopolitan, and they exhibit niche specialization that puts finches and cichlids to shame. Oh well, it’s not like I’m going to trade in the pipette and computer for a Westy, a good net, and a notepad anytime soon. . . Probably not, anyway.

decisions, decisions…

I met with Jonathan today to talk about my future. It went well, I think. I don’t have to grab some crappy piece of the GEBA pie. I might be able to work on Drosophila gut bugs after all, or I might do more with metagenomic simulations. I’ll have to flesh that idea out a little more, but like I said, I do like metagenomics. And, I like the fact that I can sit down over a long weekend and read the entire body of metagenomics literature. Maybe I’ll do that this weekend… um, no. I’ll be wakeboarding and barbecuing!

Now, I’m going to start writing up the simulation project. I’d like to have that finished before classes start (that’s late September.) Speaking of classes, I’ve narrowed down my rotation options to David Mills, Scott Dawson, Ian Korf (if I can go with non-micro faculty) and Katie Pollard (ditto). If I can’t choose faculty outside of the MGG, then I’ll ask Mitch Singer and Kate Scow.


I’ve agreed to change my primary focus for my PhD project. Instead of developing the Drosophila gut microbiome as a model system to study the interaction of microbial communities with their environments/hosts, I’m going to do something else. So much for having a year’s head start on my dissertation research. Oh well – I’m sure it’ll be for the best. And, it makes sense that if I’m going to remain employed at the JGI, that my PhD project should be related to DOE/JGI programs. I still want to do that “Drosophila project” though. Jonathan’s still interested in doing it, but we need to write a grant before we do much (any?) more. I think I’ll propose a meeting with Deborah and/or Artyom. Maybe we can all write the grant together.

I’m also working on this simulated environmental sample. That’s coming along well, but the thing that I like best about it is that it’ll have a small work:impact ratio. Make a few libraries, write a nice paper. Not a PhD project. I do like metagenomics, though.

So, that leaves GEBA, or Genomic Encyclopedia of Bacteria and Archaea. I must say, for the record, that I think it’s unfortunate that this particular acronym seems to be sticking. I like GEM, Genomic Encyclopedia of Microorganisms, MUCH better. Both because “gem” is easier on the ears and because I know they’re eventually going to be sequencing some small eukaryotic genomes under the GEBA umbrella.

Anyway, GEBA sounds like my nightmare version of a project: open-ended descriptive science, of which I feel no sense of ownership. And, no field work! I know that it will be good for my career and I know that there are a lot of interesting components. I just need to find my niche. Fortunately, I have chosen my advisor well, and I’m confident that we’ll figure it out.


I’ve got to decide in which labs to rotate this year. I figure I’ll keep track of my options here. I’m not sure if I should choose labs because I think they’re doing cool, interesting stuff or if I should choose labs because I think I might be able to learn a useful technique.

1. John Roth
I am particularly interested in his work on adaptive mutation.
2. Scott Dawson
He’s a super-nice guy, and he’ll teach me some useful microscopy stuff. And, more than one person said that I should do a rotation with him.
3. Mitch Singer
Myxococcus seems like a very cool organism. Mitch seems like a nice, enthusiastic guy.
4. Ian Korf
This seems like a no-brainer. Especially if I’m going to be working on the GEBA project.
5. Katie Pollard
Statistics is (are?) fun! (and useful)
6. David Mills
I love wine!
7. Artyom Kopp
I feel like this is cheating a little (because I’ve already been working with him.) But, this might be a nice way to keep working on Drosophila for a while.
8. Wolf-Dietrich Heyer
9. Stefan Wuertz
If I can do a rotation outside of Microbiology, then I should definitely talk to him!
10. Michael Syvanen
Horizontal gene transfer.
11. David Begun