[Bug-wget] Planning update to wget. Should I upstream it?

Discussion:

Richard Thomas

2018-08-22 18:21:36 UTC

Hi, hope this is the correct way to do this.

I want to be able to download a webpage and all its prerequisites and
turn it into a multipart/related single file. Now, this requires
identifying and changing URLs which, as most members of this list are
no-doubt aware is a thorny problem. Fortunately, wget already does this
as part of its -p and -k options. Unfortunately, though it's amazingly
useful, it's difficult to use the output for what I want.

So I am planning on adding a way to implement this functionality
directly into wget. Either I'll rewrite the links and filenames so that
it's easy to piece together a multipart/related file from what is spit
out or I'll have wget generate the multipart/related file itself
(probably the latter or maybe both).

I was just wondering if I should bother trying to feed this back into
the project if there's any interest. Also, any suggestions on ways I can
make this as useful as possible are welcome.

Rich

Tim Rühsen

2018-08-23 07:56:13 UTC

Permalink

Hi Richard,

Post by Richard Thomas
Hi, hope this is the correct way to do this.
I want to be able to download a webpage and all its prerequisites and
turn it into a multipart/related single file. Now, this requires
identifying and changing URLs which, as most members of this list are
no-doubt aware is a thorny problem. Fortunately, wget already does this
as part of its -p and -k options. Unfortunately, though it's amazingly
useful, it's difficult to use the output for what I want.
So I am planning on adding a way to implement this functionality
directly into wget. Either I'll rewrite the links and filenames so that
it's easy to piece together a multipart/related file from what is spit
out or I'll have wget generate the multipart/related file itself
(probably the latter or maybe both).
I was just wondering if I should bother trying to feed this back into
the project if there's any interest. Also, any suggestions on ways I can
make this as useful as possible are welcome.

Feedback into the project is what it lives on :-)

Your goal sounds interesting, what do you need it for ?

We are currently developing wget2 and decided that we maintain wget 1.x
but new development should go into wget2 only.

Please see https://gitlab.com/gnuwget/wget2 for further information.

To jump in quickly, examine src/wget.c, function _convert_links(). You
could copy&paste it and amend it to your needs. Then add a new option
(see src/options.c) and call the new function instead of _convert_links().

To contribute non-trivial work, you have to assign the copyright of your
code to the FSF. Here is the standard intro/howto :-)

---------

We at GNU try to enforce software freedom through a Copyleft
license (GPL)[0]. However, to enforce the said license, someone needs to
take proactive action when violations are found. Hence, we assign the
copyrights of the code to the FSF allowing them to act against anyone
that violates the license of the code you have written.

We, the maintainers of GNU Wget hence hereby request that you assign
the copyrights of the previous contributions that you have made, and
any future contributions to the FSF.

Should you have any questions, please feel free to reply back to this
mail. We will be glad to answer them and help you out.

Once you are willing to sign the Copyright assignment documents, kindly
copy the text after the marker in this email, fill it out and send it to
***@gnu.org

[0]: https://www.gnu.org/licenses/why-assign.en.html
Please email the following information to ***@gnu.org, and we
will send you the assignment form for your past and future changes.

Please use your full legal name (in ASCII characters) as the subject
line of the message.
----------------------------------------------------------------------
REQUEST: SEND FORM FOR PAST AND FUTURE CHANGES

[What is the name of the program or package you're contributing to?]

[Did you copy any files or text written by someone else in these changes?
Even if that material is free software, we need to know about it.]

[Do you have an employer who might have a basis to claim to own
your changes? Do you attend a school which might make such a claim?]

[For the copyright registration, what country are you a citizen of?]

[What year were you born?]

[Please write your email address here.]

[Please write your postal address here.]

[Which files have you changed so far, and which new files have you written
so far?]

With best Regards, Tim

Richard Thomas

2018-08-23 21:08:58 UTC

Permalink

Post by Tim RÃ¼hsen
Feedback into the project is what it lives on :-)
Your goal sounds interesting, what do you need it for ?

Well, it's fairly trivial and there might be a better way but...

What I am looking to do is retrieve and store pages from ebay. Several
times a month, I buy electronic components and modules from ebay. Often,
the specs and instruction for these items are on the page itself. Also,
the items when they arrive are often labelled cryptically and if I
haven't been diligent with sorting them, often a trip to the page is the
best way to identify what exactly I have found in one of my boxes of
wonders.

Now, ebay makes it hard to get to these pages after a few months (though
they are still there) and they eventually become inaccessible completely
after a number of years. So what I have been doing is towards the end of
the year, grabbing all the item numbers from pages which are about to
expire and pushing them through wget. Thus I get a mirrored version of
all my purchased items. Unfortunately, this process is far from perfect.
Some items still disappear and sometimes, it seems that the vendor has
repurposed the item number and a different item is on the page.

So my goal is to have procmail trigger a process when I receive an order
confirmation from ebay, go and retrieve the relevant page then send it
as an email so that I have a permanent record of the complete info of
that page from close to the time that I placed the order.

I imagine that it might be useful for other purposes too. And I recall
hearing that Stalman would read webpages by emailing them to himself.
Presumably that was text-only though.

Rich