August 12, 2020
By: Kevin

文件编码, js中转码并且下载到本地

  1. 文件是有编码的
  2. 编码是可以转化的
  3. 在js使用FilerReader进行转码
  4. 问题

文件是有编码的

链接

下载下来以后, 编码是iso8859, cat下来是乱码

cat是个很简单的工具, 它只会拿文件指定的编码去解码, ios8859对应拉丁字符

file 345201317205065728.csv
345201317205065728.csv: ISO-8859 text, with CRLF line terminators

cat 345201317205065728.csv
���,Ա������,Ա������,ǰ�³�,���³�,��Χ,��Χ,���,�䳤(��),��Χ,�ϱ�Χ
1,����-������ɾ��,,�ο���Χ1~150,45~125,75~180,60~170,30~65,15~90,,20~70

二进制查看, 各人习惯, vim 打开后执行:%!xxd

00000000: d0f2 bac5 2cd4 b1b9 a4d0 d5c3 fb2c d4b1  ....,........,..   <= 该编码方式下,汉字字长是2个字节, 典型的使用iOS编码汉字
00000010: b9a4 b9a4 bac5 2cc7 b0d2 c2b3 a42c baf3  ......,......,..      对应GBK/gb2312
00000020: d2c2 b3a4 2cd0 d8ce a72c d1fc cea7 2cbc  ....,....,....,.
00000030: e7bf ed2c d0e4 b3a4 28d3 d229 2cb0 dace  ...,....(..),...
00000040: a72c c9cf b1db cea7 0d0a 312c c0fd d7d3  .,........1,....
00000050: 2dba f3c6 dac7 ebc9 beb3 fd2c 2cb2 cebf  -..........,,...
00000060: bcb7 b6ce a731 7e31 3530 2c34 357e 3132  .....1~150,45~12
00000070: 352c 3735 7e31 3830 2c36 307e 3137 302c  5,75~180,60~170,
00000080: 3330 7e36 352c 3135 7e39 302c 2c32 307e  30~65,15~90,,20~
00000090: 3730 0d0a                                70..

编码是可以转化的

iconv -f gb2312 -t utf-8 345201317205065728.csv > u.csv

file u.csv
u.csv: UTF-8 Unicode text, with CRLF line terminators

cat u.csv
序号,员工姓名,员工工号,前衣长,后衣长,胸围,腰围,肩宽,袖长(右),摆围,上臂围
1,例子-后期请删除,,参考范围1~150,45~125,75~180,60~170,30~65,15~90,,20~70

二进制查看, 注意这个是没有BOM的, BOM不重要, 纯做展示

 00000000: e5ba 8fe5 8fb7 2ce5 9198 e5b7 a5e5 a793  ......,......... <= 汉字三个字节, 比较典型的utf8, 注意\u 也就是e5
 00000010: e590 8d2c e591 98e5 b7a5 e5b7 a5e5 8fb7  ...,............
 00000020: 2ce5 898d e8a1 a3e9 95bf 2ce5 908e e8a1  ,.........,.....
 00000030: a3e9 95bf 2ce8 83b8 e59b b42c e885 b0e5  ....,......,....
 00000040: 9bb4 2ce8 82a9 e5ae bd2c e8a2 96e9 95bf  ..,......,......
 00000050: 28e5 8fb3 292c e691 86e5 9bb4 2ce4 b88a  (...),......,...
 00000060: e887 82e5 9bb4 0d0a 312c e4be 8be5 ad90  ........1,......
 00000070: 2de5 908e e69c 9fe8 afb7 e588 a0e9 99a4  -...............
 00000080: 2c2c e58f 82e8 8083 e88c 83e5 9bb4 317e  ,,............1~
 00000090: 3135 302c 3435 7e31 3235 2c37 357e 3138  150,45~125,75~18
 000000a0: 302c 3630 7e31 3730 2c33 307e 3635 2c31  0,60~170,30~65,1
 000000b0: 357e 3930 2c2c 3230 7e37 300d 0a         5~90,,20~70..

在js使用FilerReader进行转码

转码过程可以发生在浏览器, 注意这个文件编码我特意增加了BOM

(ns show.core.async
  (:require [ajax.core :as ajax]
            [ajax.protocols :refer [-body]]))

(defn container []
  "klipse容器"
  []
  js/klipse-container)

(defn create-a-add-click!
  "创建一个新的<a>标签,增加到当前klipse容器,并且自动下载点击"
  [msg]
  (let [el (container)
        a  (.createElement js/document "a")]
    (.setAttribute a "href" (str "data:text/plain;charset=utf-8,%EF%BB%BF" (js/encodeURIComponent msg)))
    (.setAttribute a "download" "my.csv")
    (.appendChild el a)
    (.click a)))

 
(def reader "reader可以用来指定编码读写" (js/FileReader.))

(set! (.. reader -onload) (fn [e] 
                            (prn (.-result reader))
                            (create-a-add-click! (.-result reader))))

(ajax/GET "https://aidingshan-qiniu.3vyd.com/store-message/345201317205065728.csv"
            {:response-format {:type :blob
                               :read -body}
             :handler (fn [body]
                        (.readAsText reader body "GB2312"))})   

这次带BOM了

file u-with-bom.csv
u-with-bom.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators

二进制验证

 00000000: efbb bfe5 ba8f e58f b72c e591 98e5 b7a5  .........,...... <= 注意前面增加的efbbbf
 00000010: e5a7 93e5 908d 2ce5 9198 e5b7 a5e5 b7a5  ......,.........
 00000020: e58f b72c e589 8de8 a1a3 e995 bf2c e590  ...,.........,..
 00000030: 8ee8 a1a3 e995 bf2c e883 b8e5 9bb4 2ce8  .......,......,.
 00000040: 85b0 e59b b42c e882 a9e5 aebd 2ce8 a296  .....,......,...
 00000050: e995 bf28 e58f b329 2ce6 9186 e59b b42c  ...(...),......,
 00000060: e4b8 8ae8 8782 e59b b40d 0a31 2ce4 be8b  ...........1,...
 00000070: e5ad 902d e590 8ee6 9c9f e8af b7e5 88a0  ...-............
 00000080: e999 a42c 2ce5 8f82 e880 83e8 8c83 e59b  ...,,...........
 00000090: b431 7e31 3530 2c34 357e 3132 352c 3735  .1~150,45~125,75
 000000a0: 7e31 3830 2c36 307e 3137 302c 3330 7e36  ~180,60~170,30~6
 000000b0: 352c 3135 7e39 302c 2c32 307e 3730 0d0a  5,15~90,,20~70..

如果我们的问题是: 我怎么把一个gb2312的csv文件转化为utf8下载, 这个问题算是解决了, 但...是我们的问题吗?

问题

csv文件如果是utf8编码的, 咱们不用绕这么大个圈子吧??????

Tags: clojurescript