Raw strings

6 months ago, back when I was reading the C source of the Emacs reader I tried to implement raw strings in Emacs. This post was supposed to be written/published earlier but I had a lot of work in between, I’m still not very comfortable writing in English and I had a hosting problem. Anyway, here it is.

A raw string is just a special syntax for a string literal where the content is interpreted literally (especially the character \) i.e. nothing can be escaped or interpolated. Several programming languages handle them e.g.:

Python: r"aa\naa"    r"""aa\n"aa"""
Perl:   'aa\naa'     q{aa\n'aa}
C++11:  R"(aa\naa)"  R"foo(aa\n)aa)foo"

It’s very useful for regexes because every time you need to match a character that also happens to be a meta-character (like + or \) you have to escape it. And since the regex is written in a string literal you have to escape the escape character because they both use \ as the escape character. This process can be painful and error-prone. Google backslash hell or backslashitis for some examples.

Back to Emacs. I actually wrote a working proof of concept in the form of 2 patches to the reader function:

  • Triple-quoted strings (à la Python) (diff)
  • Custom-delimiter strings (à la Perl/sed) (diff)

The code is not very clean and may be buggy since most of it comes from the regular string syntax code but it works:

# Python
$ ./emacs -Q -batch --eval '(message #r"""ha"\nha""")'

# Perl
$ ./emacs -Q -batch --eval '(message #r,ha"\nha,)'
$ ./emacs -Q -batch --eval '(message #r~ha"\nha~)'

Although the reader works, some minor parts of Emacs are broken in the presence of raw strings (sexp navigation, font-locking, C-x C-e, …). These other parts of the environment need to be aware of the new syntax and shouldn’t be too hard to fix.

At this point I posted my result to the emacs-devel mailing-list which led to an interesting discussion. There was no clear consensus but I think most people realized that raw strings are not a satisfying solution to the regex problem. Some would rather have a way to write custom syntax reader in Lisp which is nice but hard to implement. Others said you’re better off using rx.

rx is a macro that lets you write readable regex in the form of s-expressions:

(rx (+ "abc") "foo" (group (or "zob" "foo")))
=> "\\(?:abc\\)+foo\\(\\(?:foo\\|zob\\)\\)"

I personally think raw strings have their use outside of regexes and would be a nice addition to the Emacs Lisp language. As for the regex I now write mine with rx all the time. I just wish there was a built-in way to use rx in interactive search/replace functions. I will work on this eventually if someone hasn’t done this already.

That’s all for today.

