I agree it's a tricky problem. As you said, the time duration can be stochastic, so you can't really do an accurate test. On the other hand, it's also tricky to set an appropriate baseline, i.e., the target to which the performance of the new code is compared. Benchmarking in general is tricky because there are often too many factors that can have different impact on performance. Take the string processing functions in this post as an example: their performance depends a lot on the size of the input. For small documents, I don't think there will be much difference. To benchmark fairly and meaningfully, we need to know the common size of knitr input documents (well, a "common size" may not exist, and perhaps I should say the "statistical distribution").
BTW, you may have heard this from Donald Knuth:
[...] premature optimization is the root of all evil (or at least most of it) in programming.