Red Creation Technology Blog: clj&cljs中使用正则表达式(regex)

June 12, 2020

By: kevin

clj&cljs中使用正则表达式(regex)

regex
在clj&cljs中如何用regex？
regex在clj和cljs的底层实现是什么？
一些匹配规则(模式)

i:大小写敏感
m:多行模式
i:单行模式
+? *? 贪婪模式

应用情景

匹配
查找
替换
分割

regex

clojure和clojurescript正则表达式都是#开头的字符串, 比如: #"regex".

实现上两者差异巨大, 应用上差别细微，形式上则统一, 除特殊说明，以下模式clj/cljs均支持.

文章介绍clojure语言中的正则表达式的用法, 正则表达式本身不是本文的介绍范围, 请参考其他资料.

在clj&cljs中如何用regex？

通过字面量：不同语言，正则的字面量不一样。clojure的是用 #"xxx" 井号打头的字符串这种形式，reader会自动将其读取成一个regex。
还可以通过函数动态生成： (re-pattern "cat") ;=> #"cat"

regex在clj和cljs的底层实现是什么？

在JVM上会产生一个 java.util.regex.Pattern

(type #"cat") ;; => java.util.regex.Pattern

在JS上会产生一个 RegExp

(type #"cat") ;; => #object[RegExp]

一些匹配规则(模式)

i:大小写敏感

是否大小敏感? insensitive的缩写i

;; => ("abc" "ABC" "AbC")
(re-seq #"(?i)abc" "abc-ABC-AbC" )

;; => ("abc")
(re-seq #"abc" "abc-ABC-AbC" )

m:多行模式

是否把换行作为特殊字符？ multiple-line缩写m,single-line缩写s。

;;;;;; 多行模式
;; => nil
(re-seq #"^a.*b$" "ab\nab")

;; => ("ab" "ab")
(re-seq #"(?m)^a.*b$" "ab\nab")

;; => ("ab" "ab")
(re-seq #"(?m)^a.*b$" "ab\n\rab")

i:单行模式

;;;;;; 单行模式
;; => ("ab\n\rab")  ;;把换行符也作为普通字符
(re-seq #"(?s)^a.*b$" "ab\n\rab")

+? *? 贪婪模式

是否限制贪婪？

;;横向伸展时，*和+都是默认贪婪的，也就是找到最后一个符合的条件才罢休。
;; => ("abcdxyzc")
(re-seq #"a.*c" "abcdxyzc")

;; => ("abcdxyzc")
(re-seq #"a.+c" "abcdxyzc")

;;如果在贪婪后加一个拷问？就会在第一次匹配后收手。
;; => ("abc")
(re-seq #"a.+?c" "abcdxyzc")

;; => ("abc")
(re-seq #"a.*?c" "abcdxyzc")

应用情景

匹配

(用和整个字符串同宽的筛子) 这时使用 re-match

;;结果会有以下几种情况：
(re-matches #"abc" "zzzabcxxx") ;;=> nil 因为不整体匹配。

;;=> "abc" 全匹配，返回该匹配。
(re-matches #"abc" "abc")

;;=> "abcxyz" 通配全匹配，返回该匹配。
(re-matches #"abc.*" "abcxyz")

;;=> ["abcxyz" "xyz"] 不光通配全匹配，还通过group,达到解构的效果。
(re-matches #"abc(.*)" "abcxyz")

;;配合(group),还可以结合解构

;=> Kevin Li
(let [[_ first-name last-name] (re-matches #"(\w+)\s(\w+)" "Kevin Li")]
  (if first-name ;; successful match
      (println first-name last-name)
      (println "Unparsable name")))

查找

(用小筛子找局部符合的部分) 这时用 re-find和re-seq

;; 仅查找第一个局部匹配 用re-find
;; 结果会有以下几种情况：

;=> Nil 一个都没筛选到。
(re-find #"cat" "the best of best")

;=> "best" 找到第一个就结束。
(re-find #"best" "the best of best")

;=> "best of best" re-find加通配符，可能会疑惑：结果怎么不是我想要的。
(re-find #"b.*t" "the best of best")

;=> "best" 限制一下贪婪可能是你的原意。
(re-find #"b.*?t" "the best of best")

;=> ["best" "be"] 依然可以利用解构。
(re-find #"(be)st" "the best of best")

;; 穷尽查找所有局部匹配 用re-seq
;; 结果会有以下几种情况：

;=> nil
(re-seq #"cat" "the best of best")

;=> ("best" "best") 穷尽所有匹配，才止步。
(re-seq #"best" "the best of best")

;=> (["best" "be"] ["best" "be"])
(re-seq #"(be)st" "the best of best")

替换

替换所有发现，用clojure.string/replace

;=> "the beast of beast" 两个be都替换了。
(clojure.string/replace "the best of best" #"be" "bea")

;=> "theee beeest of beeest"
(clojure.string/replace "the best of best" #"(e)" "$1$1$1")

;=> "the bank of bank" 插播时间：...表示3个占位
(clojure.string/replace "the best of best" #"b..." "bank")

;=> "the beauty of beauty" replace还可以结合解构+函数。
(clojure.string/replace "the best of best" #"(be.)(.)"
                        (fn [[_ a b]]
                          (str "beau"
                               (str b "y"))))

仅替换第一个发现，用clojure.string/replace-first

(defn id-replaced-by-new-initial
  "按约定，id的打头initial往往就是company_id,比如Woo7,邦8,城9."
  [id old-initial new-initial]
  (if (clojure.string/starts-with? id old-initial)
      (clojure.string/replace-first id (re-pattern old-initial) new-initial) ;;replace-first避免把后面的7/8/9替代， re-pattern动态生成regexPattern.
      id))

;=> "9008"
(id-replaced-by-new-initial "8008" "8" "9")

分割

利用空格分割

;=> ["This" "is" "a" "string" "that" "I" "am" "splitting."]
(clojure.string/split "This is a string    that I am splitting." #"\s+")

Tags: regex

« 外部Clojure学习外部材料梳理-2023-06-14版本 osx系统下用emacs作为默认编辑器 »