Discussion:
[Bug-wget] wget in a 'dynamic' pipe
Paul Wagner
2018-07-19 15:24:06 UTC
Permalink
Dear wgetters,

apologies if this has been asked before.

I'm using wget to download DASH media files, i.e. a number of URLs in
the form domain.com/path/segment_1.mp4, domain.com/path/segment_2.mp4,
..., which represent chunks of audio or video, and which are to be
combined to form the whole programme. I used to call individuall
instances of wget for each chunk and combine them, which was dead slow.
Now I tried

{ i=1; while [[ $i != 100 ]]; do echo
"http://domain.com/path/segment_$((i++)).mp4"; done } | wget -O foo.mp4
-i -

which works like a charm *as long as the 'generator process' is finite*,
i.e. the loop is actually programmed as in the example. The problem is
that it would be much easier if I could let the loop run forever, let
wget get whatever is there and then fail after the counter extends to a
segment number not available anymore, which would in turn fail the whole
pipe. Turns out that

{ i=1; while true; do echo
"http://domain.com/path/segment_$((i++)).mp4"; done } | wget -O foo.mp4
-i -

hangs in the sense that the first process loops forever while wget
doesn't even bother to start retrieving. Am I right assuming that wget
waits until the file specified by -i is actually fully written? Is
there any way to change this behavour?

Any help appreciated. (I'm using wget 1.19.1 under cygwin.)

Kind regards,

Paul
Tim Rühsen
2018-07-19 15:35:29 UTC
Permalink
Post by Paul Wagner
Dear wgetters,
apologies if this has been asked before.
I'm using wget to download DASH media files, i.e. a number of URLs in
the form domain.com/path/segment_1.mp4, domain.com/path/segment_2.mp4,
..., which represent chunks of audio or video, and which are to be
combined to form the whole programme.  I used to call individuall
instances of wget for each chunk and combine them, which was dead slow. 
Now I tried
  { i=1; while [[ $i != 100 ]]; do echo
"http://domain.com/path/segment_$((i++)).mp4"; done } | wget -O foo.mp4
-i -
which works like a charm *as long as the 'generator process' is finite*,
i.e. the loop is actually programmed as in the example.  The problem is
that it would be much easier if I could let the loop run forever, let
wget get whatever is there and then fail after the counter extends to a
segment number not available anymore, which would in turn fail the whole
pipe.  Turns out that
  { i=1; while true; do echo
"http://domain.com/path/segment_$((i++)).mp4"; done } | wget -O foo.mp4
-i -
hangs in the sense that the first process loops forever while wget
doesn't even bother to start retrieving.  Am I right assuming that wget
waits until the file specified by -i is actually fully written?  Is
there any way to change this behavour?
Any help appreciated.  (I'm using wget 1.19.1 under cygwin.)
Hi Paul,

Wget2 behaves like what you need. So you can run it with an endless loop
without wget2 hanging.

I should build under CygWin without problems, though my last test is a
while ago.

See https://gitlab.com/gnuwget/wget2

Latest tarball is
https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz

or latest git
git clone https://gitlab.com/gnuwget/wget2.git


Regards, Tim
Dale R. Worley
2018-09-11 03:34:07 UTC
Permalink
Post by Paul Wagner
Now I tried
{ i=1; while [[ $i != 100 ]]; do echo
"http://domain.com/path/segment_$((i++)).mp4"; done } | wget -O foo.mp4
-i -
which works like a charm *as long as the 'generator process' is finite*,
i.e. the loop is actually programmed as in the example. The problem is
that it would be much easier if I could let the loop run forever, let
wget get whatever is there and then fail after the counter extends to a
segment number not available anymore, which would in turn fail the whole
pipe.
Good God, this finally motivates me to learn about Bash coprocesses.

I think the answer is something like this:

coproc wget -O foo.mp4 -i -

i=1
while true
do
rm -f foo.mp4
echo "http://domain.com/path/segment_$((i++)).mp4" >&$wget[1]
sleep 5
# The only way to test for non-existence of the URL is whether the
# output file exists.
[[ ! -e foo.mp4 ]] && break
# Do whatever you already do to wait for foo.mp4 to be completed and
# then use it.
done

# Close wget's input.
exec $wget[1]<&-
# Wait for it to finish.
wait $wget_pid

Dale
Tim Rühsen
2018-09-11 07:43:12 UTC
Permalink
Post by Dale R. Worley
Post by Paul Wagner
Now I tried
{ i=1; while [[ $i != 100 ]]; do echo
"http://domain.com/path/segment_$((i++)).mp4"; done } | wget -O foo.mp4
-i -
which works like a charm *as long as the 'generator process' is finite*,
i.e. the loop is actually programmed as in the example. The problem is
that it would be much easier if I could let the loop run forever, let
wget get whatever is there and then fail after the counter extends to a
segment number not available anymore, which would in turn fail the whole
pipe.
Good God, this finally motivates me to learn about Bash coprocesses.
coproc wget -O foo.mp4 -i -
i=1
while true
do
rm -f foo.mp4
echo "http://domain.com/path/segment_$((i++)).mp4" >&$wget[1]
sleep 5
# The only way to test for non-existence of the URL is whether the
# output file exists.
[[ ! -e foo.mp4 ]] && break
# Do whatever you already do to wait for foo.mp4 to be completed and
# then use it.
done
# Close wget's input.
exec $wget[1]<&-
# Wait for it to finish.
wait $wget_pid
Dale
Thanks for the pointer to coproc, never heard of it ;-) (That means I
never had a problem that needed coproc).

Anyways, copy&pasting the script results in a file '[1]' with bash 4.4.23.

Also, wget -i - waits with downloading until stdin has been closed. How
can you circumvent that ?

Regards, Tim
Dale R. Worley
2018-09-12 01:51:35 UTC
Permalink
Post by Tim Rühsen
Thanks for the pointer to coproc, never heard of it ;-) (That means I
never had a problem that needed coproc).
Anyways, copy&pasting the script results in a file '[1]' with bash 4.4.23.
Yeah, I'm not surprised there are bugs in it.
Post by Tim Rühsen
Also, wget -i - waits with downloading until stdin has been closed. How
can you circumvent that ?
The more I think about the original problem, the more puzzled I am. The
OP said that starting wget for each URL took a long time. But my
experience is that starting processes is quite quick. (I once modified
tar to compress each file individually with gzip before writing it to an
Exabyte type. On a much slower processor than modern processors, the
writing was not delayed by starting a process for each file written.)

I suspect the delay is not starting wget but establishing the initial
HTTP connection to the server.

Probably a better approach to the problem is to download the files in
batches on N consecutive URLs, where N is large enough that the HTTP
startup time is well less than the total download time. Process each
batch with a seperate invocation of wget, and exit the loop when an
attempted batch doesn't create any new downloaded files (or, the last
file in the batch doesn't exist), indicating there are no more files to
download.

Dale
Paul Wagner
2018-09-12 05:25:42 UTC
Permalink
Dear all,
Post by Dale R. Worley
Post by Tim Rühsen
Thanks for the pointer to coproc, never heard of it ;-) (That means I
never had a problem that needed coproc).
Anyways, copy&pasting the script results in a file '[1]' with bash 4.4.23.
Yeah, I'm not surprised there are bugs in it.
Post by Tim Rühsen
Also, wget -i - waits with downloading until stdin has been closed. How
can you circumvent that ?
The more I think about the original problem, the more puzzled I am.
The
OP said that starting wget for each URL took a long time. But my
experience is that starting processes is quite quick. (I once modified
tar to compress each file individually with gzip before writing it to an
Exabyte type. On a much slower processor than modern processors, the
writing was not delayed by starting a process for each file written.)
I suspect the delay is not starting wget but establishing the initial
HTTP connection to the server.
That's what the OP thinks, too. I attributed the slow startup to DNS
resolution.
Post by Dale R. Worley
Probably a better approach to the problem is to download the files in
batches on N consecutive URLs, where N is large enough that the HTTP
startup time is well less than the total download time. Process each
batch with a seperate invocation of wget, and exit the loop when an
attempted batch doesn't create any new downloaded files (or, the last
file in the batch doesn't exist), indicating there are no more files to
download.
Neat idea. Finally, I solved it by estimating the number of chunks from
the total running time and the duration of each chunk. But thanks for
giving it a thought!

Regards,

Paul
Dale R. Worley
2018-09-13 03:21:06 UTC
Permalink
Post by Paul Wagner
That's what the OP thinks, too. I attributed the slow startup to DNS
resolution.
Depending on your circumstances, one way to fix that is set up a local
caching-only DNS server. Direct ordinary processes to use that. Then
the first lookup is expensive, but the caching server saves the
resolution and answers later queries very quickly.

Dale

Loading...