Discussion:
[Bug-wget] Wget running out of memory
Giovanni Porta
2018-09-24 05:38:35 UTC
Permalink
Hello all,

For the past week or so, I've been attempting to mirror a website with Wget. However, after a couple days of downloading (and approx 38 GB downloaded), Wget eventually exhausts all system memory and swap leading to the process getting killed. The server I'm using has 2GB of RAM and 2GB of swap.

I'm using Ubuntu 16.04, initially with Wget 1.17.1, however I have also compiled and tried the newest version, 1.19.5, after reading this bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=642563

My luck has not changed since updating, and I'm at a loss of what else to do. Surely it isn't normal for the memory usage to climb like this?

These are the parameters I'm using to download:
wget --load-cookies cookies.txt --warc-file="site" -mirror --convert-links --adjust-extension --page-requisites --random-wait --accept-regex ".*(ubb=cfrm)|(ubb=postlist)|(ubb=showflat)|(images)|(styles)|(ubb_js).*" --restrict-file-names=windows "https://example.com"

Here is the log from dmesg leading up to the process being killed: https://pastebin.com/gR4cGQdA

Any ideas?

Thanks.
Gio
Tim Rühsen
2018-09-24 13:31:53 UTC
Permalink
Hi Giovanni,
Post by Giovanni Porta
Hello all,
For the past week or so, I've been attempting to mirror a website with Wget. However, after a couple days of downloading (and approx 38 GB downloaded), Wget eventually exhausts all system memory and swap leading to the process getting killed. The server I'm using has 2GB of RAM and 2GB of swap.
I'm using Ubuntu 16.04, initially with Wget 1.17.1, however I have also compiled and tried the newest version, 1.19.5, after reading this bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=642563
My luck has not changed since updating, and I'm at a loss of what else to do. Surely it isn't normal for the memory usage to climb like this?
It surely isn't "normal", but your use case maybe isn't normal as well
(*days* of downloading, 38GB).

Recursive downloads require wget to keep each downloaded URL in memory,
to not download those again and again. Additionally, --convert-links
requires more memory to track data from parsing. Then, if the server
sends new cookies for each visited page, you have again additional
memory consumption since these are all kept in memory.

All this sums up over time, and 2GB simply isn't enough for your task.
Post by Giovanni Porta
Any ideas?
- get more RAM
- split that one huge download into several smaller ones

That's all I can come up with :-)

Regards, Tim

Loading...