I feel like I'm going insane. Everyone says 512x512 should work with 8gb but when I do it I get:
CUDA out of memory. Tried to allocate 3.00 GiB (GPU 0; 8.00 GiB total capacity; 5.62 GiB already allocated; 0 bytes free; 5.74 GiB reserved in total by PyTorch)
any ideas? I have a 3060ti with 8gb vram...
with 448x448 I get:
CUDA out of memory. Tried to allocate 902.00 MiB (GPU 0; 8.00 GiB total capacity; 6.73 GiB already allocated; 0 bytes free; 6.86 GiB reserved in total by PyTorch)
I've been trying to get the basujindal fork to work, but it seems to be putting all work on the CPU. I've been running the example txt2img prompt for 30 minutes now and it's still not finished. It has reserved 4Gb memory from the GPU, but the GPU doesn't appear to be doing any work, only CPU is doing work.
I now did everything I could to constrain the memory usage of the original SD repo, I was finally able to get it to run, and it produced green squares as output :(
What I did:
- scripts/txt2img.py, function - load_model_from_config, line - 63, change from: model.cuda() to model.cuda().half()
- removed invisible watermarking
- reduced n_samples to 1
- reduced resolution to 256x256
- removed sfw filter
Just can't get it to work and it's not producing an error message or anything that I could debug it with.
Your model is overflowing/underflowing generating NaNs. I got it with memory optimised, increased resolution (multiples of 32, 384 x 384) and full precision while keeping it in 4 GB.
Which is so silly since ML models should be the most portable thing in the world. It's just a series of math operations, not a bunch of OS/hardware specific API calls or something like that. We should be at a stage where each ML model is boiled down to a simple executable with zero dependencies at this point.
Agree 100% and I spend a fair amount of time wondering why this hasn't happened. I built piet-gpu-hal because I couldn't find any abstraction layer over compute shaders that supports precompiled shaders. A motivated person absolutely could write shaders to do all the operations needed by Stable Diffusion, and ship a binary in the megabyte range (obviously not counting the models themselves). That would support Metal, Vulkan, and D3D12. The only thing holding this back is a will to build it.
This is the part that tensorflow is really good at, while just about everything else lags behind. The tf saved model is the graph plus weights, and is super easy to just load up and run. (Also, tflite for mobile...)
But one of the tricky parts with stable diffusion is that people are trying to get it to run on lighter hardware, which is basically another engineering problem where simple apis typically won't expose the kind of internals people want to mess around with.
My laptop takes about 6 seconds per iteration so it's significantly slower, but if you're willing to wait I bet you'll have a much easier time plugging more RAM into your system than adding VRAM.
I've been running it fine on my 3060 Ti, then again I don't have any monitors connected so the full 8GB is free. Check VRAM usage, I'm guessing you don't have 8GB free, more like 5-6GB, since you have monitors connected.
Also, you could try Visions of Chaos and use the Mode > Machine Learning > Text-to-Image > Stable Diffusion. It also has tons of other AI tools e.g. image-to-text captioning, diffusion model training, mandelbrot, music, and a ton more. The dev(s) push out updates almost every day.
Warning: You will first need to go through the 12 steps of Machine Learning setup first[0], then it will download 3-400GB of models since it has scripts for pretty much every latent diffusion out there, some of which e.g. Disco Diffusion I find to still give more interesting results and you can get much higher res on a 3060 Ti, plus you have a TON more parameters to play with, not to mention you can train your own models and load those in (which I've been doing the past few weeks using my photography to get away from using unlicensed imagery :)
Oh sorry I guess i need to mention that you need to put the text encoder on the cpu (or precompute the text embedding somehow). (Im using a custom codebase to make that possible idk how trivial that is to achieve with StableDiffusionPipeline.) Only the unet and vae should be on the gpu.
For your case with 8 gb you shouldn’t need to do either of those things (run it all on gpu), just make sure you have batch size 1 and are using the fp16 version.
On my 3070 I get that error unless I set my batch size to 1. My typical setup is to do six batches of one and it works fine (although I minimize the number of visible things on my screen while it's running). This reliably produces one image every 7-8 seconds.
For some reason — no idea why — this problem went away when I set n_samples to 1 and scale to 10.0 or less. Why these parameters would impact memory usage, I don’t know, but the image quality seems fine, afaict.
n_samples is the batching number. Total memory used scales like "Model Mem Size + n_samples * Batch Mem Size". The memory needed for a batch is smaller than the model but not trivial.