We run into the same kind of challenges to test programs with IO in Coq.io.
To stay pure, we keep the IOs uninterpreted until the compilation (to an OCaml program). We define the tests on the program with uninterpreted IOs. For clarity of the tests, we use an interactive debugger (reusing the tactical mode of Coq) to step through the IO operations. The main advantage of using Coq is that the tests can be made symbolic, thus covering a larger number of cases (if not all the cases). A simple example is explained here: http://coq.io/getting_started.html#use-cases (coincidentally, this is the same example as in this blog post).
If your tests fail for any of the reasons the author leads with, and don't get fixed straight away with some soul searching on the part of your developers, you need to start hiring some engineers.
Hi, I have personal experience working with the some of the best software engineers on the planet for the last decade and despite everyone's best efforts, when a unit test suite grows to thousands and thousands of tests, there will always be some that are flaky. Perhaps they call into a code path which was unwittingly refactored to access a local database. Perhaps they read the current time for some time-dependent string formatting. Perhaps they mutate some global state and forget to clean it up. Perhaps they call into some code that opens a TCP connection to some third-party service.
Every large test suite I've seen in every language at every company I've worked with has intermittent or flaky tests. Maybe each test on average is 99.9% reliable. But across the entire suite, you will have some low-level degree of intermittent test failures. Then you need a team of people whose job it is to triage, investigate, and either fix or ignore those failures. This is a complete waste of human capacity, and it's not a very career-enriching task either. Haskell's ability to define restricted subsets of generalize IO is extremely powerful and sidesteps this entire problem.
The power of the technique in the article is that your tests are _guaranteed_ reliable by the type system, because it is no longer possible to accidentally call into some general IO action. Code under test must go through the World API, which is fully and deterministically mocked in tests.
I have worked with huge test suites. If your test suite contains tests that fail for the reasons given, and the result is a shrug, your test suite becomes useless very quickly, and then triaging which tests are allowed to fail, which are real problems, and so on becomes a much bigger waste of time and treasure. Making a big deal out of it when tests are added that mutate global state or call 3rd party APIs is an investment that pays back very quickly. Shrugging and saying "98% of the tests passed, that's probably good enough" means your test suite itself is a waste of time and resources and you could be getting the same result by just doing a quick happy path minimal set.
Agreed with that. In a world where you can't statically enforce that tests are deterministic, you absolutely must be diligent. However, it still happens.
An example of "innocent intermittence" recently came up at my current employer: there was a test that invoked a web service twice, verifying idempotence. It then asserted the responses were identical. This worked almost all of the time, but when the two service invocations straddled a minute boundary, then some output text that included a rendered timestamp (rounded to the minute), which would cause the test to fail.
It's extremely hard to get an engineering organization to just inherently know not to call, say, datetime.now() in their code, especially when the tests pass most of the time.
The problem is that developers do not know that they mustn't call datetime.now() and there will always be a developer who doesn't know this, either because people have a tough time knowing everything, or just because you're hiring new people.
Wouldn't it be nice if you could just obstruct such patterns within the first millisecond of their first interaction with the compiler?
> An example of "innocent intermittence" recently came up
> at my current employer: there was a test that invoked a
> web service twice, verifying idempotence. It then
> asserted the responses were identical.
This is not innocent intermittence, this is a bad test. No-one should get in trouble for it, but I would certainly expect senior developers to get involved to find out:
Why is idempotence being tested for on the response side rather than on the execution end? If the call is truly idempotent, then a changing date-stamp means that the test has captured the information envelope as well as the data payload, and is testing that, which is sloppy - you'd probably be upset to see the HTTP headers being tested in the same string, so why other pieces of the envelope too? If a test is calling a "web service", then it's either an integration test (so why's it testing for idempotence, rather than testing the envelope?) or it's a unit test gone awry. Are there other tests that do this too? Could this all be abstracted out in to a sensible library to make it easier for more junior developers to not make these mistakes?
The moment you start accepting less than 100% test pass rates as being OK, the moment madness descends and you lose much of the value of your test suite.
The trickiness comes when you have to write these mocks, and whether the mocks are based off of correct assumptions.
Granted, Haskell's type system goes a long way to helping with that (often a function of type A -> B only has a handful of possible definitions given sufficiently complex types A and B), but it's still worth using actual production-use code paths when feasible.
EDIT: Where I work, we have a large amount of tests, but we have pretty much never deployed an instance of our application without having every test pass.
The idea of having 100% of your tests pass is a lot easier when it's the default, and when you're doing more continuous integration.
If you're using a language like Haskell, your business logic will already be pretty separate from the ickiness of the IO layer, so most of your tests will be deterministic and nice by default.
If you're using a messier language... well there's a lot of test suites out there. If you're using something that's mainly Django/Rails-y stuff, you don't have many excuses not to be 100% passing by default. Most major frameworks/languages offer extensive mocking tools and the like to avoid non-determinent behavior in tests.
> Haskell's ability to define restricted subsets of generalize IO is extremely powerful and sidesteps this entire problem.
Except it doesn't, precisely where that problem is most annoying. For example, suppose you want to test a scheduler, or any mechanism that places a timeout on an operation. You could abstract away the entire timing mechanism, but that is exactly what you're trying to test: maybe the developer used some system function wrong, and is now waiting 5 seconds instead of 5 milliseconds?
This approach (which is the same as the mocking approach -- there's nothing special about Haskell here) doesn't do too well when you want to test timing, and those tests are usually the most brittle (hosted continuous integration services often introduce really long pauses). Of course, you could weave virtual time services throughout your code, but then you have the third-party library problem again.
Mocking (or IO isolating as you call it in Haskell) ends up producing really good unit-tests, but often the more important tests span more than one small unit, and making those non-flaky is very hard.
In theory, you could replace all references to time in a JVM language -- even in libraries -- using an agent or a bootstrap classloader in order to make larger tests run predictably, but I'm not aware of people going to such great lengths (maybe Google does it; I know, for example, that when it comes to accessing the filesystem, they replace the JVM's filesystem provider with an in-memory file-system[1] when they don't mock, but I don't know how the deal with clocks).
EDIT: Apparently somebody has done just that as a research project[2]. It's actually given me an idea for a nice afternoon-project.
I get the feeling you're spouting very strong opinions without ever having actually written real production code in Haskell. Because what you're saying is nonsense.
Anything you can do in an imperative language you can do in Haskell - the only difference is that Haskell allows you to define subsets of IO that are enforced by the type system.
Mocking time is crucial for testing timeouts. You don't want your test to actually take five seconds. You want to rapidly test the code in the case that a timeout occurs before completion of the action and vice versa. You want those tests to be 100% deterministic and fast. With the World approach described you can get that.
Also, there is nothing about the technique described that limits you to unit tests. IMVU tests entire web services with this approach and it works well. "World" in that case includes APIs for accessing Redis, MySQL, the customer's session, time, concurrency, and so on. They're all perfectly faked out in tests, and the tests are 100% reliable.
Of course I haven't written production code in Haskell, and I'm not saying that there's stuff you can't do in Haskell.
So far, I've said two things. One is that the technique described in the post, which "uses the power of Haskell", is actually trivially done (and has been used by countless projects for years) in almost all imperative languages, and less intrusively (it doesn't require introducing new types and works with third-party code). I don't know much about Ruby and Python, but the JVM also allows you to define subsets of IO (even restricted by actual files) that are enforced by the runtime, or subsets of IO operations that will be enforced by the compiler. This technique has been used in imperative languages for a very long time. That's not an opinion but a fact.
The second thing I've said is an opinion, and is unrelated to a specific programming language. I said that a lot of very small, fully-faked unit-tests only get you so far, and it's fairly easy to get those tests to run 100% predictable. The more troublesome tests are those that aren't faked, and they are crucial to expose many bugs. That's it. I don't think I've said anoything too controversial.
Thanks for the clarification. But statements like this are why I think you're spouting crap:
"is actually trivially done (and has been used by countless projects for years) in almost all imperative languages, and less intrusively"
Is simply not true. In a JavaScript test framework, drawing from top-of-mind, recent examples, you cannot prevent the code under test from replacing document.location.
In Java, as you gave in another examples in this thread, you can't statically prevent access to the RNG or perhaps the current time.
In Python, there's nothing stopping any code from calling into a C module that has arbitrary effects.
Haskell is unique among any languages I've ever used in production that allows deterministic testability as a static, enforced-by-the-compiler guarantee.
I partially agree with your second comment, and I do think other forms of testing (such as fuzz testing, integration tests with third parties, acceptance tests) are valuable, but the technique described can be used in much larger tests than simple unit tests. As I mentioned, IMVU uses it for testing entire web service request handlers.
I get the feeling some people think that World is onerous and expensive to set up, but in practice you just write all of your code in World and it's not boilerplatey or complicated at all. The framework around World is set up once and never touched again.
* Disclaimer: I no longer work at IMVU, but I super super super miss Haskell in production.
I don't know too much about JS or python (I'm mostly C/C++/Java/Clojure/Matlab) but there's nothing to stop you from accessing C, in Haskell either, nor from using unsafePerformIO, and while in Java you can't prevent access to random numbers etc. statically, but you can certainly do it at runtime (though, in Java, too, you can call C and then all bets are off).
> Haskell is unique among any languages I've ever used in production that allows deterministic testability as a static, enforced-by-the-compiler guarantee.
Oh, that's true. It's just that if your tests are guaranteed to be deterministic, I don't see why the fact that the property is enforced statically is so important.
Personally, I think Haskell sacrifices a lot for the sake of static guarantees (including the ability to reason about programs not before they run, but during and after). Also, its reliance on Curry-Howard for verification is limited and narrow-sighted, and its insistence on referential transparency everywhere is interesting yet misguided. I'm now preparing a talk for the Curry On/ECOOP conference where I'll show -- among other things -- how all monads can be easily translated to imperative constructs (I won't go into details here, but what you need is continuations and continuation-global variables) in a way that makes them more understandable and compose more naturally than in PFP (i.e. there's no need for monad transformers). Even though this model does not equate the subroutine with the mathematical function -- as PFP does -- it is no less verifiable than the PFP approach. But I digress...
You can prevent access to `unsafePerformIO` and other unsafe language features using "Safe" Haskell (e.g. -XSafe). This happens at compile time. Functions like unsafePerformIO (IO a -> a), pure C FFI functions, overlapping instances,and TemplateHaskell are not allowed. Nothing is stopping you from accessing C but you can force FFI function to reside in IO.
Haskell lets us have generic "mocks", which is -way- better than any kind of expectation-based mock. (admittedly, the example in andy's blog post doesn't show the power of this technique very well)
But my experience with mocks is that expectation-based mocks are fragile and make it easy to write tests that pass but are fundamentally wrong; if you just say you expect XYZ to come out of the database, even if your code gets refactored to not insert XYZ into the database earlier, the test will pass despite the code being completely broken.
With the haskell World approach, you don't have that - the fake database is fake because it's in memory (and thus very fast and uses a blank state for each test), but it's actually implementing the same semantics as the real database.
Some of the other IO stuff we've encapsulated is just expectation-based mocks and as you said that's not new, but it's still clean and doesn't require any changes to the code under test to make it mockable, which we did when we used mocks in our PHP.
In-memory implementations are very useful, especially for larger tests, but, again, I don't see anything in Haskell that makes it easier. For example, on the JVM tests often use Google's Jimfs[1] in-memory file system (the JVM has pluggable file systems implemented as Java interfaces), and for relational databases they use H2[2]. None of this requires any code changes, perhaps thanks to the JVM's extreme pluggability. The JVM uses SPIs for almost everything, from encryption to filesystems to sockets -- you can fake the entire OS (almost) in Java, and what you can't fake easily, you can still fake with an agent (unrelated, but starting in Java 9, even the JIT is pluggable, so you can control machine-code generation at any granularity you want). Just since my conversation here with chadaustin, I've written a small library that uses an agent to make the entire JVM use a virtual clock[3] -- again, with no code changes necessary in the tested code.
How do you verify that your mock behaves exactly as the real database? The type system certainly helps with function signatures, but what about behavior?
> For example, suppose you want to test a scheduler, or any mechanism that places a timeout on an operation. You could abstract away the entire timing mechanism, but that is exactly what you're trying to test: maybe the developer used some system function wrong, and is now waiting 5 seconds instead of 5 milliseconds?
Time is an IO effect, so it's true that mocking it away removes that test. I'd treat that kind of "do I understand the API?" issue as an integration test, and split the problem up a little:
- "Mock" the timeout calls, to allow as many tests as possible to run outside IO
- Write tests where the operation always succeeds (eg. "Just x")
- Write tests where the operation always times out (eg. "Nothing")
- Write tests which do a combination of success/timeout and a bunch of different interleavings (QuickCheck would probably be useful here)
- Write some small integration tests, eg. give a timeout of "5" to a process that sleeps for 5 seconds, it should time out; give a timeout of "5" to a process that sleeps for 4 seconds, it should succeed; and so on.
> Mocking (or IO isolating as you call it in Haskell) ends up producing really good unit-tests, but often the more important tests span more than one small unit, and making those non-flaky is very hard.
Haskell's IO isolation causes the vast majority of a program to be defined outside IO. You're right that the trickiest parts are often in IO, but at least those are pretty much separate from any complex calculation/logic.
> but at least those are pretty much separate from any complex calculation/logic.
Yes, but in imperative languages (with meta-programming) you can disentangle mixed IO and logic even without changing the code (or if it's in a third-party library). The advantage is that it allows you (not always, but sometimes) to support even larger tests in this way.
Getting those tiny, one-function unit-tests to run predictably is not hard in any language; you just make them smaller and smaller and mock everything else. The tricky bit is making larger tests run predictably, and a flexible, hackable runtime has the advantage there. That's precisely where meta-programming shines.
> in imperative languages (with meta-programming) you can disentangle mixed IO and logic even without changing the code
I don't understand what you mean by "mixed", "disentangle" and "change".
I would say "mixed IO and logic" is when one block of code describes both how to calculate some value and which side-effects to perform. "Disentangling" these would mean separating the description of the calculation from the description of the side-effects. Doing that "without changing the code" doesn't make sense to me.
What I mean is that if your code (or a library you call) mixes logic with IO, meta-programming lets you capture the IO calls in a mock (like turning it into an IO monad), which you can then query and verify, without changing the code and the API's it's using.
E.g. on the JVM you do that with Mockito (similar mocking libraries exist for other languages/runtimes). Say you have a (silly) method that does computation and IO:
@Test
public void testPrintDouble() {
PrintStream out = mock(PrintStream.class);
printDouble(3, out);
verify(out).println("Result: 6");
}
No output is actually done. The call to println (which may be deep in some library printDouble calls rather than directly in the tested code). We've essentially replaced any object that does direct IO with something akin to an IO monad.
We can query the mock in more sophisticated ways, too. Here's a very contrived example:
This contrived example also verifies the relative ordering of IO done on two separate streams. Again, no API change is necessary, and the actual IO calls may be deep in some library.
OK, so you're on about a physical/temporal separation of IO and logic: the logical calculations can be performed separately to the IO actions, and vice versa.
I view that as pretty trivial though; it's just a case of switching out the language's IO primitives, which are effectively free variables. With most "large" languages (Java, Haskell, etc.), the implementations are usually gnarly enough to make such substitution a significant engineering accomplishment, but that's an artefact of the situation, not the approach. As an analogy, the task of fitting corrective optics to the Hubble space telescope was only difficult because it was the Hubble space telescope; fitting adaptive optics is trivial enough that bespectacled children do it every day.
In light of this, I think that disentangling mixed-up definitions is a much more important and interesting problem. It's non-trivial to extract pure computations from imperative code (whether they're written in Java, or Haskell, or whatever), yet there is a great potential benefit for improving code, in terms of understandability, modularity, re-use, maintainence, etc. The fact that pure code is easier to test, or that side-effects can be substituted for the purposes of tests, is really a minor point in comparison. They are workarounds for running code which has too many responsibilities. Much better to avoid the problem in the first place by writing separate code for separate concerns.
Yes, I agree that it's better to write good code than bad code :)
Clear separation of concerns is an important aspect of good code, and every language teaches this practice. I don't think Haskell has any advantage there.
I also think -- and this is probably a point of contention -- that the way Haskell separates pure computation from side-effects is a bit arbitrary. Haskell defines computation in a way similar to lambda calculus, which designates as impure any change to something which may then affect a function's result not through its arguments. In short, it equates the mathematical notion of the function with the software notion of the subroutine. Haskell subroutines are mathematical functions, period (well, except for unsafe etc.). I don't think that this is the best way to describe code, or to define separation of concerns (It also makes Haskell jump through hoops when it deals with things that are natural for computation but don't fit well with the notion of a mathematical function). A good separation of concerns in Haskell is not necessarily the best organization for Java/Python.
I don't know, other than good supervision by an experienced developer :) This one, though, isn't that good, IMO. Perhaps future effect systems will show us the way.
So let me try to see if I get it. You change the signature of your I/O functions to return a self defined monad called World. To make sure this works in production you make World an instance of I/O so it can be ran by your regular main.
Then for the test suite, you make an instance of World that keeps a state monad and mock implementations of the I/O functions that return/change that state instead of having undeterministic side effects.
The whole process seems logical, but the Haskell is rather ugly. Surely someone has written a DSL or library to make working like this a little easier? Especially defining the FakeState monad with S.modify calls looks like a pain.
The thing is that while the implementation of the fake can get ugly, that's effectively library code that's written once - the actual application code is incredibly clean since all you have to do to it is tag it as being World instead of IO.
Likewise, for tests, the actual interface in practice is clean - we have a few fake-World-only setup functions, and when testing things like the database we can just use the regular insert interface to do our setup.
Where s is the type of your state and a is the type of your result. It's basically a function that takes the current state and gives you a result as well as a new state.
To those interested, the exact same technique is achievable in imperative languages[1] -- only with less effort and in a less intrusive way -- with mocks.
[1]: That support at least some notion of meta-programming/reflection.
I think you've misunderstood the implications of the technique described.
It's superficially similar to mocks, but the real power is that it defines _restricted_ effects, so that it is _impossible_ for the code under test to access, say, the current system time or to print to stdout. It is _only_ allowed to access APIs which are fully replaced in tests.
The real benefit is that the type system guarantees these tests are not flaky or intermittent - in unit tests they are guaranteed pure and deterministic.
This technique isn't applicable to languages like Python or C++ or JavaScript where any computation is free to have arbitrary side effects. In Haskell you can restrict computations to subsets of side effects, which is an enormously powerful technique, and is, for example, why Haskell's STM implementation is so great (and simple) compared to languages with unrestricted effects.
First, let's separate two kinds of effects: memory effects and IO. While PFP treats memory effects as side-effects separate from program logic, imperative languages do not (either approach has its pros and cons). As the post talks about IO, let's restrict ourselves to that.
If you believe that the main contribution of this approach is by absolutely preventing any kind of uncaptured IO (I think it is extremely valuable even without this language-supervised restriction), then this too, is trivially possible in, say, Java (or all other JVM languages). Just install a security manager in your tests, and it will make sure you don't accidentally access IO by bypassing mocks.
This would still have the advantage of not writing your program in any special way to accommodate this technique and it would apply to third-party libraries, too.
I didn't know you could use Java SecurityManager to implement a similar system. That's cool. Do you know anyone who does that in their test suite? I'd love to chat with them.
Sadly, this technique, like any blacklist, doesn't work for types of IO that can't be prevented, like mutations to document.location.
The technique described in the article is effectively a named whitelist of IO operations (called World) that, say, all HTTP request handlers are restricted to.
> Do you know anyone who does that in their test suite? I'd love to chat with them.
I do. Not often, as I don't have a lot of IO in my tests, but I've found the security manager useful for that purpose from time to time. I also use it to help our users enforce global contracts, such as prohibit IO or any blocking code in fork-join computations.
That's cool. I chatted with a friend about this too and they said they've used SecurityManager in this way. But one thing they said is that SecurityManager can't be used to restrict things like accessing the current time. Do you know if that's true or not?
That's true -- out of the box. As with all things JVM, this, too is very pliable. You can create a new permission -- "access clock" -- and then it is quite easy to use an agent to inject a call for the security manager to check for this permission whenever the clock is accessed.
In fact, since we started this discussion, I've written a small library that uses an agent to fake all clock accesses on the JVM with a user-supplied virtual clock, that can be set globally or per-thread. Injecting s security check is even easier.
> Also, if tests take a while to run, it's nice to get the feedback from the compile step, rather than the run step.
True. We do that in Java, too. The idea is that you don't want to let methods use any "global" IO access (like System.out or new FileOutputStream) -- only IO streams that are passed in as arguments (which can then be mocked) -- so you use an annotation processor[1] to prohibit any such use; it's automatically picked up, loaded and used by all IDEs to mark any use of "prohibited operations" as errors while you type. Square/Google's Dagger DI library uses the same approach to make sure -- at compile time, with IDE support -- that all dependencies are satisfied.
Again, this works without requiring any use of new/non-standard APIs.
[1]: It's a class that gets run during one of the compilation phases, like a poor-man's pattern macros.
You can, but in Haskell you can enlist the compiler's help to prove that everything is completely airtight.
This property continues to hold, even as the application and the tests are modified by developers who do not even fully know what sort of effects they should otherwise be concerned with.
For instance, most programming platforms give you pretty ready access to a pseudo-random number generator. This is great until some well-intentioned new hire checks in code someplace that uses it to instrument something 1 out of 1000 runs.
We can't have this problem in Haskell and we don't have to spend any ongoing code review hours on it. It's gone.
Yep, it's also great that we can enlist the runtime's help to do the same thing in Java (or any other JVM language) and support use of third-party libraries (the Haskell approach places a pretty severe burden on the code) and make it even more airtight (IIANM you can't prevent unsafePerformIO): We just disable IO by enabling the security manager with a "no-IO" policy[1]. So it's less intrusive, more widely applicable and more airtight than the Haskell approach.
True, the security manager does not limit access to the random-number generator, but this, too, can be done with an extra bit of one-time effort, by injecting a security check into the random number generator's seed generation method (we still want to allow fixed seeds like StdGen).
[1]: Or a finer-grained policy of "no IO except for logging".
Why, for testing. The calls to IO are replaced with calls to mocks which are then used to verify IO behavior. It's like replacing the IO monad with a type that doesn't actually do IO, only it doesn't require the code to change, so it works even for third-party code.
You can then use the security manager during the test to verify that all IO calls are indeed mocked.
To stay pure, we keep the IOs uninterpreted until the compilation (to an OCaml program). We define the tests on the program with uninterpreted IOs. For clarity of the tests, we use an interactive debugger (reusing the tactical mode of Coq) to step through the IO operations. The main advantage of using Coq is that the tests can be made symbolic, thus covering a larger number of cases (if not all the cases). A simple example is explained here: http://coq.io/getting_started.html#use-cases (coincidentally, this is the same example as in this blog post).