I love to learn about details of how cluster rendering works? How you break down the task? There is too much dependency between components(light, physics and material) compared to web server world
Light does not interact with itself under normal circumstances, so you can render each pixel completely separately once you have the environment.
In raytracing you render in reverse starting from the "eye", not forward from the lights, so each pixel of each frame can be parallelized. (With perhaps a small amount of mixing at the end to handle aliasing/quantization errors.)
It's done on frames because that's how renderers work. You build a description of a single image and the renderer renders it. The renderer will generate threads which will all render a 'bucket', which is just a square block of pixels in the image.
It's possible to parallelise this across several processes on different machines, but this is obviously less efficient because you'll have to do all the render startup tasks (reading resource files, building acceleration structures) multiple times.
I can't go into any specifics, but the general approach that most places use these days is to assign complete frames to each node on the renderfarm. The nodes are multi-core and threads divvy up the work of rendering the image in tiles. That's not to say that production renderers can't do multi-process renders too, but you don't see it nearly as often as multi-threaded renders. Remember that it can often take several minutes just to read scenes in; even if you split up an image into individual pixels and gave each node just a single pixel to do there'd still be a limit to how quickly you could get a frame back. So if you parallelize across your cluster on frames, then yes, there's usually several hours latency for a frame sequence, but you also get much better throughput.
By the way, if you haven't heard of Blinn's Law [0], you may find it interesting.
This is not true. Frames are queued to computers with free cores and memory. Splitting up frames across multiple computers is a brute force option that is rarely used. When you get to that point on show it is because render times are huge and often because renders have failed and need to be forced through during the day to make a deadline. It is a telltale sign of disaster show.
On an individual box frames are split up into tiles and the tiles are rendered on individual cpu cores.
I don't know what you mean by "a few" frames per day, but it sounds like you have a misconception about how it works. Typically many, many thousands of frames are rendered every day, and most of those are rendered multiple times as artists iterate.
The unit of parallelization for renders is always the frame. On a machine with multiple cores each core will render tiles in parallel, but across machines jobs are split up by frame. This is both because it's the simplest type of distribution for people to understand (machine goes bad? You lose one frame, not arbitrary pixels in an image). But also because it's most efficient for a single machine to read all the data for a single frame instead of multiple machines requesting the same data repeatedly across the network. I think people tend to underestimate just how massive the geometry and scene description files are for a typical feature and how much of the work involves managing the storage and network efficiency.
They don't do final renders immediately. They render previews with expensive features like global illumination turned off.
Rendering a single frame across multiple machines sounds wasteful. They would have to load the exact same textures and models for a single frame across all of them. When batch rendering, it would be more efficient to do that work just once per frame.