Hadley Wickham writes:
The best way to impact the world as a data scientist or statistician is to be useful. This column gives my advice on being useful:
• Write code
• Work in the open
• Tell the world
(There are lots of other ways to be useful, but this is my path.)
Every modern statistical and data analysis problem needs code to solve it. You shouldn’t learn just the basics of programming, spend some time gaining mastery. Improving your programming skills pays off because code is a force multiplier: once you’ve solved a problem once, code allows you to solve it much faster in the future. As your programming skill increases, the generality of your solutions improves: you solve not just the precise problem you encountered, but a wider class of related problems (in this way programming skill is very much like mathematical skill). Finally, sharing your code with others allows them to benefit from your experience.
I’m partial towards R packages as a way of writing and distributing code. R packages are great because they include not just R code, but also documentation, sample data, compiled C/C++ code, and tests. R packages are easily accessible by millions of R users because getting your code on to their computer is just a single function call (devtools::install_github() or install.packages()).
In the open
Writing code is much easier if you don’t do it alone. Your goal should not be to strive in solitude for years before releasing the perfect package. Instead, work in the open, publishing not only the final product but every intermediate stage. If you work this way, you’ll get feedback much earlier in the process, and your motivation will remain high because you know people care.
There are two keys to working in the open. First, release your code with an open source license. There are many licenses to pick from, but try not to get bogged down in the details. I recommend starting at http://choosealicense.com which summarizes the most important licenses. Second, learn to use Git and GitHub. Git is a vital tool for collaboration and GitHub gives your code a home on the web where others can easily view it, report bugs, and suggest improvements.
Once you’ve got some code that does something useful, you need to show people how to use it. Start by describing it in text. If you’re writing an R package, write a vignette, a long form document that describes how to apply your package to solve real problems.
The key to effective teaching is put yourself in the mind of a novice. Always start with the motivation: why should someone care about your package? What awesome things does your package make easy? Show some examples of the cool stuff, and then dive into the details. I find that writing about my code improves its quality: it forces me recognize the rough edges, inconsistencies and missing special cases.
If you have the opportunity to teach in person, take it! When you teach in person, you can only cover a small fraction of the material that you’ve written about, but it is incredibly helpful because it gives immediate feedback on what’s hard to understand and what is easy.
Tell the world
It doesn’t matter how great your work is if no-one knows about it. If you want to have an impact on the world, you need to think about marketing. While many academics think marketing is a dirty word, it’s not actually about tricking people into using your tools. Instead, it’s about making other people’s lives easier by letting them know about your awesome tools.
There’s lots to say about marketing and I’m certainly no expert. But I think the most important thing to remember is that it’s not about you. It doesn’t matter how many hours you’ve spent developing the software, or how many awards you’ve won, or how fantastically wonderful your code is. Instead, get out of the picture, and explain why using your code will make life easier. Kathy Sierra has a great blog post about this in the context of talks: Presentation skills considered harmful (http://seriouspony.com/blog/2013/10/4/presentation-skills-considered-harmful). (In fact, I’d strongly encourage you to read every article on her old and new blogs.)
Concretely, I think the best way to let people know about your work is to post updates on a blog and on Twitter.
There are many great role models who are applying these principles every day. Here are a handful who I’m particularly impressed by:
• The Simply Statistics group has a great blog and an active Twitter account, @simplystats. They have been teaching thousands of people about statistics and data science with their Coursera courses. I also love Jeff Leek’s open guides to sharing data, writing packages, reviewing papers and more (https://github.com/jtleek).
• ROpenSci is a community of scientists who are developing R packages to make open science easier. To date, they have published over 30 packages to CRAN. They also organize ‘hackathons’ and tutorials to help scientists get better at programming and data analysis.
If you want to have an impact on the world, start applying these principles today!