Discussion:
wget file-writing error with japanese characters
Jamie Zawinski
2006-07-11 20:29:24 UTC
Permalink
wget 1.10.2
MacOS 10.4.7 Intel

I'm trying to download a file whose URL contains Japanese characters.

If I specify -O, it is able to download the data; but if wget is
picking the file name itself, it is unable to write the file
("invalid argument"). Neither --restrict-file-names=unix nor --
restrict-file-names=windows affects it.

I guess wget and the OS disagree about what characters can go in file
names?

I also tried setting $LANG and $LOCALE to "C" to no effect.

This is a default HFS+ file system, running in the American English
locale.

% wget -d 'http://somehost/~somewhere/music/Dir%20en%20grey/%e9%
ac%bc%e8%91%ac/01%20%e9%ac%bc%e7%9c%bc%20-kigan-.m4a'
DEBUG output created by Wget 1.10.2 on darwin8.6.1.

--13:20:52-- http://somehost/~somewhere/music/Dir%20en%20grey/%
e9%ac%bc%e8%91%ac/01%20%e9%ac%bc%e7%9c%bc%20-kigan-.m4a
=> `01 鬼ç%9C¼ -kigan-.m4a'
Resolving [...]
Caching [...]
Connecting to [...]|:80... connected.
Created socket 4.
Releasing 0x00507520 (new refcount 1).

---request begin---
GET /~somewhere/music/Dir%20en%20grey/%e9%ac%bc%e8%91%ac/01%20%
e9%ac%bc%e7%9c%bc%20-kigan-.m4a HTTP/1.0
Accept: */*
Authorization: Basic [...]
Host: [...]
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Date: Tue, 11 Jul 2006 20:19:36 GMT
Server: Apache/1.3.33 (Darwin) mod_perl/1.29
Last-Modified: Thu, 22 Dec 2005 20:03:49 GMT
ETag: "1517d-5ab8c7-43ab06a5"
Accept-Ranges: bytes
Content-Length: 5945543
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: audio/mp4a-latm

---response end---
200 OK
Registered socket 4 for persistent reuse.
Length: 5,945,543 (5.7M) [audio/mp4a-latm]
01 鬼ç%9C¼ -kigan-.m4a: Invalid argument
Disabling further reuse of socket 4.
Closed fd 4

Cannot write to `01 鬼ç%9C¼ -kigan-.m4a' (Invalid argument).
Exit 1

"touch" also fails with that file name:

touch: 01 鬼ç%9C¼ -kigan-.m4a: Invalid argument



--
Jamie Zawinski ***@jwz.org http://www.jwz.org/
***@dnalounge.com http://www.dnalounge.com/
http://jwz.livejournal.com/
Hrvoje Niksic
2006-07-11 21:13:55 UTC
Permalink
Post by Jamie Zawinski
If I specify -O, it is able to download the data; but if wget is
picking the file name itself, it is unable to write the file
("invalid argument"). Neither --restrict-file-names=unix nor --
restrict-file-names=windows affects it.
It could be that your system expects UTF-8 in file names and rejects
what it figures are invalid UTF-8 sequences. In the general case I
suspect it's impossible to portably guess the file name charset the
file system supports. I thought Unix wouldn't be picky about 8-bit
chars at least in the 160-255 range, but that was apparently too
optimistic.

Maybe we should add something like --restrict-file-names=ascii. It
could be used on brain-damaged file systems and ensure that only
printable ascii chars (32-126) can be used in file names, in addition
to the restrictions of the operating system (so
--restrict-file-names=windows,ascii would also work). In the same
vein, "utf-8" could check for valid UTF-8 sequences.
Jamie Zawinski
2006-07-11 23:26:38 UTC
Permalink
Post by Hrvoje Niksic
It could be that your system expects UTF-8 in file names and rejects
what it figures are invalid UTF-8 sequences.
I think that's true: MacOS / HFS+ seems to expect file names to be
"decomposed unicode" in UTF-8. (I gather that means that accents are
separated from their characters.) In the past I've managed to
convert Unicode strings to usable file names by doing something like
this in Perl:

$file = Encode::encode("UTF8",
Unicode::Normalize::decompose($unicode_string));

Also, http://developer.apple.com/qa/qa2001/qa1173.html

--
Jamie Zawinski ***@jwz.org http://www.jwz.org/
***@dnalounge.com http://www.dnalounge.com/
http://jwz.livejournal.com/

Loading...