1.原始问题：What's so bad about Lazy I/O?
I’ve generally heard that production code should avoid using Lazy I/O. My question is, why? Is it ever OK to use Lazy I/O outside of just toying around? And what makes the alternatives (e.g. enumerators) better?
Lazy IO has the problem that releasing whatever resource you have acquired is somewhat unpredictable, as it depends on how your program consumes the data — its “demand pattern”. Once your program drops the last reference to the resource, the GC will eventually run and release that resource.
Lazy streams are a very convenient style to program in. This is why shell pipes are so fun and popular.
However, if resources are constrained (as in high-performance scenarios, or production environments that expect to scale to the limits of the machine) relying on the GC to clean up can be an insufficient guarantee.
Sometimes you have to release resources eagerly, in order to improve scalability.
So what are the alternatives to lazy IO that don’t mean giving up on incremental processing (which in turn would consume too many resources)? Well, we have
foldl based processing, aka iteratees or enumerators, introduced by Oleg Kiselyov in the late 2000s, and since popularized by a number of networking-based projects.
Instead of processing data as lazy streams, or in one huge batch, we instead abstract over chunk-based strict processing, with guaranteed finalization of the resource once the last chunk is read. That’s the essence of iteratee-based programming, and one that offers very nice resource constraints.
The downside of iteratee-based IO is that it has a somewhat awkward programming model (roughly analogous to event-based programming, versus nice thread-based control). It is definitely an advanced technique, in any programming language. And for the vast majority of programming problems, lazy IO is entirely satisfactory. However, if you will be opening many files, or talking on many sockets, or otherwise using many simultaneous resources, an iteratee (or enumerator) approach might make sense.
Dons has provided a very good answer, but he’s left out what is (for me) one of the most compelling features of iteratees: they make it easier to reason about space management because old data must be explicitly retained. Consider:
average :: [Float] -> Float average xs = sum xs / length xs
This is a well-known space leak, because the entire list
xs must be retained in memory to calculate both
length. It’s possible to make an efficient consumer by creating a fold:
average2 :: [Float] -> Float average2 xs = uncurry (/) <$> foldl (\(sumT, n) x -> (sumT+x, n+1)) (0,0) xs -- N.B. this will build up thunks as written, use a strict pair and foldl'
But it’s somewhat inconvenient to have to do this for every stream processor. There are some generalizations (Conal Elliott – Beautiful Fold Zipping), but they don’t seem to have caught on. However, iteratees can get you a similar level of expression.
aveIter = uncurry (/) <$> I.zip I.sum I.length
This isn’t as efficient as a fold because the list is still iterated over multiple times, however it’s collected in chunks so old data can be efficiently garbage collected. In order to break that property, it’s necessary to explicitly retain the entire input, such as with stream2list:
badAveIter = (\xs -> sum xs / length xs) <$> I.stream2list
The state of iteratees as a programming model is a work in progress, however it’s much better than even a year ago. We’re learning what combinators are useful (e.g.
enumWith) and which are less so, with the result that built-in iteratees and combinators provide continually more expressivity.
That said, Dons is correct that they’re an advanced technique; I certainly wouldn’t use them for every I/O problem.
I use lazy I/O in production code all the time. It’s only a problem in certain circumstances, like Don mentioned. But for just reading a few files it works fine.
Another problem with lazy IO that hasn’t been mentioned so far is that it has surprising behaviour. In a normal Haskell program, it can sometimes be difficult to predict when each part of your program is evaluated, but fortunately due to purity it really doesn’t matter unless you have performance problems. When lazy IO is introduced, the evaluation order of your code actually has an effect on its meaning, so changes that you’re used to thinking of as harmless can cause you genuine problems.
As an example, here’s a question about code that looks reasonable but is made more confusing by deferred IO: withFile vs. openFile
These problems aren’t invariably fatal, but it’s another thing to think about, and a sufficiently severe headache that I personally avoid lazy IO unless there’s a real problem with doing all the work upfront.
Update: Recently on haskell-cafe Oleg Kiseljov showed that
unsafeInterleaveST (which is used for implementing lazy IO within the ST monad) is very unsafe – it breaks equational reasoning. He shows that it allows to construct
bad_ctx :: ((Bool,Bool) -> Bool) -> Bool
> bad_ctx (\(x,y) -> x == y) True > bad_ctx (\(x,y) -> y == x) False
== is commutative.
Another problem with lazy IO: The actual IO operation can be deferred until it’s too late, for example after the file is closed. Quoting from Haskell Wiki – Problems with lazy IO:
For example, a common beginner mistake is to close a file before one has finished reading it:
wrong = do fileData <- withFile "test.txt" ReadMode hGetContents putStr fileData
The problem is withFile closes the handle before fileData is forced. The correct way is to pass all the code to withFile:
right = withFile "test.txt" ReadMode $ \handle -> do fileData <- hGetContents handle putStr fileData
Here, the data is consumed before withFile finishes.
This is often unexpected and an easy-to-make error.
See also: Three examples of problems with Lazy I/O.