Discussion:
wget through proxy is slow (Connection:Keep-Alive ignored?)
Yazeed Hamid
2008-01-07 16:36:54 UTC
Permalink
Hi to all.
Thank you very much for your efforts and a happy new year to all.

I have been using wget in a bash script to measure website response times
over different proxy configurations.

/usr/bin/time "%e Seconds" /usr/bin/wget -pEkq
--delete-after --proxy=$switch $URL[$index]

I'm using wget version 1.10.2 on cygwin running on Windows Vista (v
6.0.6000Build 6000).
When there is a proxy to go through, the corresponsent proxy address:port is
exported to the environment (export http_proxy=$proxy")

The problem is, when wget is working through a proxy, it doesn't seem to
reuse the existing
http connection like it does when --proxy=off is set. When there is no
proxy, all objects
referenced in an html file are fetched over the same connection. On the
other hand, when I am
going through a proxy, each object is fetched in a new connection although
the Connection: Keep-Alive header
is both in the http request and response messages.

As a result, the measured response time through a proxy is very much greater
than that through direct connection <no proxy>.

$ /usr/bin/time -f "%e Seconds" wget -pdEk --delete-after --proxy=off -o
log-no-proxy-w-debug www.mcafee.com
3.56 Seconds

$ /usr/bin/time -f "%e Seconds" wget -pdEk --delete-after --proxy=on -o
log-thru-proxy-w-debug www.mcafee.com
20.42 Seconds

How can I make wget make sense of the Connection: Keep-Alive message when
going through
a proxy and thus reuse the existing connection like it does when directly
connecting to the web server?
In an early post, I read something about modifying the source of wget, is
this the only solution?

Kindly see attached debug files. Thank you very much for all your efforts
and all the great work.
Micah Cowan
2008-01-07 18:22:51 UTC
Permalink
Post by Yazeed Hamid
Hi to all.
Thank you very much for your efforts and a happy new year to all.
I have been using wget in a bash script to measure website response
times over different proxy configurations.
/usr/bin/time "%e Seconds" /usr/bin/wget -pEkq
--delete-after --proxy=$switch $URL[$index]
I'm using wget version 1.10.2 on cygwin running on Windows Vista (v
6.0.6000 Build 6000).
When there is a proxy to go through, the corresponsent proxy
address:port is exported to the environment (export http_proxy=$proxy")
The problem is, when wget is working through a proxy, it doesn't seem to
reuse the existing
http connection like it does when --proxy=off is set. When there is no
proxy, all objects
referenced in an html file are fetched over the same connection. On the
other hand, when I am
going through a proxy, each object is fetched in a new connection
although the Connection: Keep-Alive header
is both in the http request and response messages.
As a result, the measured response time through a proxy is very
much greater than that through direct connection <no proxy>.
$ /usr/bin/time -f "%e Seconds" wget -pdEk --delete-after --proxy=off -o
log-no-proxy-w-debug www.mcafee.com <http://www.mcafee.com>
3.56 Seconds
$ /usr/bin/time -f "%e Seconds" wget -pdEk --delete-after --proxy=on -o
log-thru-proxy-w-debug www.mcafee.com <http://www.mcafee.com>
20.42 Seconds
How can I make wget make sense of the Connection: Keep-Alive message
when going through
a proxy and thus reuse the existing connection like it does when
directly connecting to the web server?
In an early post, I read something about modifying the source of wget,
is this the only solution?
Probably.

It does look like you're right that Wget is dropping the connection;
it'd be more certain if i could see tcpdump output indicating which side
closed the connection first; but IIRC the debug messages are different
when the remote side closes the connection first (I haven't time to
check the source to see for myself just now).

I'll file an issue for this, but it's not likely to be addressed anytime
soon.

You mentioned another message that talked about modifying the source; do
you have a reference to it? If someone has already found an appropriate
patch for this, it can be in much quicker than if you wait for me to get
around to it. ;)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
L Walsh
2008-01-07 21:14:50 UTC
Permalink
Post by Yazeed Hamid
I'm using wget version 1.10.2 on cygwin running on Windows Vista (v
6.0.6000 Build 6000).
When there is a proxy to go through, the corresponsent proxy
address:port is exported to the environment (export http_proxy=$proxy")
The problem is, when wget is working through a proxy, it doesn't seem to
reuse the existing
http connection like it does when --proxy=off is set. When there is no
proxy, all objects
referenced in an html file are fetched over the same connection. On the
other hand, when I am
going through a proxy, each object is fetched in a new connection
although the Connection: Keep-Alive header
is both in the http request and response messages.
As a result, the measured response time through a proxy is very
much greater than that through direct connection <no proxy>.
----
I just tried both of your examples to mcfee.com (taking 3.56 and 20.42
seconds). My proxy server is on a linux machine, so I had to run my tests
Post by Yazeed Hamid
time wget -pdEk --delete-after --proxy=off -o log-no-proxy-w-debug www.mcafee.com
Setting --html-extension (htmlextension) to 1;
Setting --convert-links (convertlinks) to 1
Setting --delete-after (deleteafter) to 1
Setting --proxy (useproxy) to off
Setting --output-file (logfile) to log-no-proxy-w-debug
2.62sec 0.01usr 0.02sys (1.44% cpu)
Post by Yazeed Hamid
time wget -pdEk --delete-after --proxy=on -o log-no-proxy-w-debug www.mcafee.com
Setting --html-extension (htmlextension) to 1
Setting --convert-links (convertlinks) to 1
Setting --delete-after (deleteafter) to 1
Setting --proxy (useproxy) to on
Setting --output-file (logfile) to log-no-proxy-w-debug
2.56sec 0.02usr 0.03sys (2.14% cpu)

You are running on Windows. MS networking isn't known for its
speed -- especially on open/close operations, but I wouldn't think it
would be that bad. They deliberately put in slowdowns on non-server
editions of Windows starting in XP-SP2 on opening some types
of network connections -- that could be part of the cause -- but
again, a 17 second delay seems unreasonable.

Also depends on what your proxy server does. I know
squid has parameters (persistent_request_timeout, client_lifetime,
pconn_timeout) to set the timeout for re-usable connections.
While the defaults in a standard 'squid' setup are reasonable,
You didn't specify what proxy you were using nor do we know how
it is configured.

For What Its Worth -- I tried the wget statement on
my Windows-xp box. I only tested the 'with-proxy' case, since
my windows box isn't on the external net (has to go through
the linux proxy). It came out with times similar to those
run on the proxy machine: 2.68sec 0.04usr 0.10sys (5.72% cpu)

I'd check the proxy. My linux wget-1.10.1, and windows
wget (under cygwin) = 1.10.2.

Good luck.
Yazeed Hamid
2008-01-10 23:51:48 UTC
Permalink
Post by L Walsh
Post by Yazeed Hamid
I'm using wget version 1.10.2 on cygwin running on Windows Vista (v
6.0.6000 Build 6000).
When there is a proxy to go through, the corresponsent proxy
address:port is exported to the environment (export http_proxy=$proxy")
The problem is, when wget is working through a proxy, it doesn't seem to
reuse the existing
http connection like it does when --proxy=off is set. When there is no
proxy, all objects
referenced in an html file are fetched over the same connection. On the
other hand, when I am
going through a proxy, each object is fetched in a new connection
although the Connection: Keep-Alive header
is both in the http request and response messages.
As a result, the measured response time through a proxy is very
much greater than that through direct connection <no proxy>.
----
I just tried both of your examples to mcfee.com (taking 3.56 and 20.42
seconds). My proxy server is on a linux machine, so I had to run my tests
Post by Yazeed Hamid
time wget -pdEk --delete-after --proxy=off -o log-no-proxy-w-debug
www.mcafee.com
Setting --html-extension (htmlextension) to 1;
Setting --convert-links (convertlinks) to 1
Setting --delete-after (deleteafter) to 1
Setting --proxy (useproxy) to off
Setting --output-file (logfile) to log-no-proxy-w-debug
2.62sec 0.01usr 0.02sys (1.44% cpu)
Post by Yazeed Hamid
time wget -pdEk --delete-after --proxy=on -o log-no-proxy-w-debug
www.mcafee.com
Setting --html-extension (htmlextension) to 1
Setting --convert-links (convertlinks) to 1
Setting --delete-after (deleteafter) to 1
Setting --proxy (useproxy) to on
Setting --output-file (logfile) to log-no-proxy-w-debug
2.56sec 0.02usr 0.03sys (2.14% cpu)
You are running on Windows. MS networking isn't known for its
speed -- especially on open/close operations, but I wouldn't think it
would be that bad. They deliberately put in slowdowns on non-server
editions of Windows starting in XP-SP2 on opening some types
of network connections -- that could be part of the cause -- but
again, a 17 second delay seems unreasonable.
Also depends on what your proxy server does. I know
squid has parameters (persistent_request_timeout, client_lifetime,
pconn_timeout) to set the timeout for re-usable connections.
While the defaults in a standard 'squid' setup are reasonable,
You didn't specify what proxy you were using nor do we know how
it is configured.
For What Its Worth -- I tried the wget statement on
my Windows-xp box. I only tested the 'with-proxy' case, since
my windows box isn't on the external net (has to go through
the linux proxy). It came out with times similar to those
run on the proxy machine: 2.68sec 0.04usr 0.10sys (5.72% cpu)
I'd check the proxy. My linux wget-1.10.1, and windows
wget (under cygwin) = 1.10.2.
Good luck.
Thank you for your concern. I have attempted the same test on a Linux box
(Red Hat Enterprise Linux 5)
on the same network and the result was similar (even a little slower). I
must check
our proxy configuration. Thank you once again.

Loading...