Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you have even just 4gb stable diffusion will run fine if u go for 448x448 instead (basically the same quality).


I feel like I'm going insane. Everyone says 512x512 should work with 8gb but when I do it I get:

    CUDA out of memory. Tried to allocate 3.00 GiB (GPU 0; 8.00 GiB total capacity; 5.62 GiB already allocated; 0 bytes free; 5.74 GiB reserved in total by PyTorch)
any ideas? I have a 3060ti with 8gb vram...

with 448x448 I get:

    CUDA out of memory. Tried to allocate 902.00 MiB (GPU 0; 8.00 GiB total capacity; 6.73 GiB already allocated; 0 bytes free; 6.86 GiB reserved in total by PyTorch)


Use halfprecision float and/or the optimized forks

https://github.com/basujindal/stable-diffusion

https://github.com/neonsecret/stable-diffusion

Or the hlky webui, that is optimized too.

http://rentry.co/kretard


I've been trying to get the basujindal fork to work, but it seems to be putting all work on the CPU. I've been running the example txt2img prompt for 30 minutes now and it's still not finished. It has reserved 4Gb memory from the GPU, but the GPU doesn't appear to be doing any work, only CPU is doing work.


Use the original SD repo. But modify the txt2img.py according to:

https://github.com/CompVis/stable-diffusion/issues/86#issuec...


I now did everything I could to constrain the memory usage of the original SD repo, I was finally able to get it to run, and it produced green squares as output :(

What I did:

- scripts/txt2img.py, function - load_model_from_config, line - 63, change from: model.cuda() to model.cuda().half()

- removed invisible watermarking

- reduced n_samples to 1

- reduced resolution to 256x256

- removed sfw filter

Just can't get it to work and it's not producing an error message or anything that I could debug it with.


Your model is overflowing/underflowing generating NaNs. I got it with memory optimised, increased resolution (multiples of 32, 384 x 384) and full precision while keeping it in 4 GB.


> I feel like I'm going insane.

That's the world of running machine learning models for you. Why would anything ever work the first time right? Or at least the 10th time...


Which is so silly since ML models should be the most portable thing in the world. It's just a series of math operations, not a bunch of OS/hardware specific API calls or something like that. We should be at a stage where each ML model is boiled down to a simple executable with zero dependencies at this point.


Agree 100% and I spend a fair amount of time wondering why this hasn't happened. I built piet-gpu-hal because I couldn't find any abstraction layer over compute shaders that supports precompiled shaders. A motivated person absolutely could write shaders to do all the operations needed by Stable Diffusion, and ship a binary in the megabyte range (obviously not counting the models themselves). That would support Metal, Vulkan, and D3D12. The only thing holding this back is a will to build it.


This is the part that tensorflow is really good at, while just about everything else lags behind. The tf saved model is the graph plus weights, and is super easy to just load up and run. (Also, tflite for mobile...)

But one of the tricky parts with stable diffusion is that people are trying to get it to run on lighter hardware, which is basically another engineering problem where simple apis typically won't expose the kind of internals people want to mess around with.


Others may have reduced the batch size (n_samples) to reduce the memory load. A lower batch size will significantly help with the memory consumption.

This comment: https://news.ycombinator.com/item?id=32710550 talks about running SD with 8GiB of VRAM and mentions needing to reduce this parameter to 1 to get it to output right.


This helped and I finally generated something larger than 256x256 :D thanks


If you're okay waiting a while linger and have plenty of RAM, https://github.com/bes-dev/stable_diffusion.openvino has a somewhat CPU-optimized version as well that relies on system memory rather than VRAM.

My laptop takes about 6 seconds per iteration so it's significantly slower, but if you're willing to wait I bet you'll have a much easier time plugging more RAM into your system than adding VRAM.


I've been running it fine on my 3060 Ti, then again I don't have any monitors connected so the full 8GB is free. Check VRAM usage, I'm guessing you don't have 8GB free, more like 5-6GB, since you have monitors connected.

Also, you could try Visions of Chaos and use the Mode > Machine Learning > Text-to-Image > Stable Diffusion. It also has tons of other AI tools e.g. image-to-text captioning, diffusion model training, mandelbrot, music, and a ton more. The dev(s) push out updates almost every day.

Warning: You will first need to go through the 12 steps of Machine Learning setup first[0], then it will download 3-400GB of models since it has scripts for pretty much every latent diffusion out there, some of which e.g. Disco Diffusion I find to still give more interesting results and you can get much higher res on a 3060 Ti, plus you have a TON more parameters to play with, not to mention you can train your own models and load those in (which I've been doing the past few weeks using my photography to get away from using unlicensed imagery :)

[0] https://softology.pro/tutorials/tensorflow/tensorflow.htm


Oh sorry I guess i need to mention that you need to put the text encoder on the cpu (or precompute the text embedding somehow). (Im using a custom codebase to make that possible idk how trivial that is to achieve with StableDiffusionPipeline.) Only the unet and vae should be on the gpu.

For your case with 8 gb you shouldn’t need to do either of those things (run it all on gpu), just make sure you have batch size 1 and are using the fp16 version.


On my 3070 I get that error unless I set my batch size to 1. My typical setup is to do six batches of one and it works fine (although I minimize the number of visible things on my screen while it's running). This reliably produces one image every 7-8 seconds.


Be aware python processes don't always terminate correctly when you keyboard interrupt out while using Pytorch.

Make sure you kill all python processes before restarting or some of your VRAM will be in use.

You can check with nvidia-smi how much ram is currently in use by what processes.


For some reason — no idea why — this problem went away when I set n_samples to 1 and scale to 10.0 or less. Why these parameters would impact memory usage, I don’t know, but the image quality seems fine, afaict.


n_samples is the batching number. Total memory used scales like "Model Mem Size + n_samples * Batch Mem Size". The memory needed for a batch is smaller than the model but not trivial.


How much ram is your gpu using before you start stable diffusion? You can check with ‘nvidia-smi’ in terminal.

The not-optimized release works with my 2070 with 8 gb ram.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: