Discussion:
[Bug-wget] [bug #53322] Add option to let page-requisites bypass no-parent
David
2018-03-11 11:43:27 UTC
Permalink
URL:
<http://savannah.gnu.org/bugs/?53322>

Summary: Add option to let page-requisites bypass no-parent
Project: GNU Wget
Submitted by: mcdado
Submitted on: Sun 11 Mar 2018 11:43:25 AM UTC
Category: Feature Request
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name:
Originator Email:
Open/Closed: Open
Discussion Lock: Any
Release: None
Operating System: None
Reproducibility: Every Time
Fixed Release: None
Planned Release: None
Regression: None
Work Required: None
Patch Included: None

_______________________________________________________

Details:

When using `--no-parent` and `--page-requisites`, if the page requires images
or other requisites from that domain but higher in hierarchy, it will not
proceed to download those requisites because of the `no-parent` option. While
`no-parent` is useful for web pages (without it you download a whole site,
especially if you `--mirror`) so it would be good to have an extra option to
allow downloading page requisites that are higher in the hierarchy.




_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Tim Ruehsen
2018-04-04 09:47:12 UTC
Permalink
Follow-up Comment #1, bug #53322 (project wget):

Why don't you leave away --no-parent then ?

--page-requisites alone should do what you want.

_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
David
2018-04-04 10:54:35 UTC
Permalink
Follow-up Comment #2, bug #53322 (project wget):

Well, I don't want to download a whole site!

Example:
wget --recursive --page-requisites
http://www.oreilly.com/openbook/osfreesoft/book/

It will start downloading everything that is linked on the page, not just the
current directory.

_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Tim Ruehsen
2018-04-04 12:20:40 UTC
Permalink
Follow-up Comment #3, bug #53322 (project wget):

Then you won't need --recursive.
And looking at that page: the links all go to different domains - so will need
-H as well.

wget -H --page-requisites http://www.oreilly.com/openbook/osfreesoft/book/

does the job for me:

FINISHED --2018-04-04 14:17:29--
Total wall clock time: 15s
Downloaded: 41 files, 569K in 0.6s (888 KB/s)



_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
David
2018-04-04 23:01:01 UTC
Permalink
Follow-up Comment #4, bug #53322 (project wget):

Thanks for your answer, but it's missing the point of the original question.

I basically want to download everything under the current folder (recursive,
no parent) but i also want to download css/js that are on the same domain but
higher in the hierarchy (page requisites, recursive).

You understand what I mean? For example, in the previous example I'd like to
download css/js and the linked pdf files.

_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Tim Ruehsen
2018-04-05 07:54:44 UTC
Permalink
Follow-up Comment #5, bug #53322 (project wget):

Maybe I was a bit unclear, sorry for that.

In your document (http://www.oreilly.com/openbook/osfreesoft/book/) is no
reference to any other resource on the same domain (www.oreilly.com).

So to download anything with --page-requisite, you'll need -H.

Without it, you just get your index.html and that's it. You didn't request
anything else.

So what *exactly* do you want. Please make a list of documents/resources you
would like to see in the end. With that we also have a potential test case, if
it comes to implement a new feature.


_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Tim Ruehsen
2018-04-05 07:58:13 UTC
Permalink
Follow-up Comment #6, bug #53322 (project wget):

Little correction: the "...is no reference..." should be "... no page
requisite..."


_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
David
2018-04-07 23:41:09 UTC
Permalink
Follow-up Comment #7, bug #53322 (project wget):

Okay, I've got an example:

wget --recursive --page-requisites --convert-links --adjust-extension
https://addyosmani.com/resources/essentialjsdesignpatterns/book/

I would like to download everything under this path:
"/resources/essentialjsdesignpatterns/book/", in this example index.html is
the only page, but if there was a second html page linked from index, that
would be downloaded too.

Plus anything page requisites from other paths, like "../../../cdn-cgi/" but
not other pages, like html documents in "/blog/".

_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Tim Ruehsen
2018-04-08 10:33:28 UTC
Permalink
Follow-up Comment #8, bug #53322 (project wget):

Just remove --recursive and you get what you want.
Keep in mind that wget doesn't run Javascript, so no dynamic created
requisites can be downloaded.

$ tree addyosmani.com/
addyosmani.com/
├── cdn-cgi
│   └── scripts
│   └── d07b1474
│   └── cloudflare-static
│   └── email-decode.min.js
└── resources
└── essentialjsdesignpatterns
├── book
│   ├── images
│   │   ├── base.png
│   │   └── ns1.png
│   ├── index.html
│   ├── scripts
│   │   └── vendor.js
│   └── styles
│   └── vendor.css
└── cover
└── cover.jpg

11 directories, 7 files


_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
David
2018-04-15 09:40:44 UTC
Permalink
Follow-up Comment #9, bug #53322 (project wget):

Probably I must have missed something, because using `--recursive
--page-requisites --no-parent` does indeed do what I want. Strange, I could
swear it was behaving differently before.

_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Darshit Shah
2018-11-13 00:10:28 UTC
Permalink
Update of bug #53322 (project wget):

Status: None => Invalid
Open/Closed: Open => Closed


_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?53322>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Loading...