An Improved Liberal, Accurate Regex Pattern for Matching URLs

silentbicycle · on July 28, 2010

This is the point at which it's worth learning actual parsing tools, rather than just winging it with REs. REs are fine for tokenizing, but cannot handle recursion, and quickly become clumsy for patterns made of distinct sub-elements.

Once you sink deeper into that turing tarpit, you end up with monstrosities like this (http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html), a RE for matching valid email addresses.

jacobolus · on July 28, 2010

Except the whole point of this thing, as clearly explained in the article, is for everyone with access to a regexp implementation to be able to reuse the same few-line regexp, as a drop-in replacement that works better than the the shitty regexps they currently use to recognize URLs. (Examples of currently bad matchers that might benefit from this code: hacker news’s, gmail’s)

When they get a tiny bit clumsy, but before they get so clumsy that adding a bunch of parsing machinery is really worth the trouble, regexps are still the best solution.

* * *

Markdown, on the other hand, John Gruber’s more famous project, would be much improved (especially for people interested in extending it) by having its specification written in terms of a real grammar.

silentbicycle · on July 28, 2010

The article mentions that doing it better would require lots of nonstandard extensions, but not that it's struggling in the first place because it's the wrong tool for the job. If more developers realized the limitations of using just REs for these tasks, languages would be better integrated with actual parsing tools.

I think that tools like LPEG (http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html) are a step in the right direction - it's a small Lua library* which provides a PEG-based parsing library. It's very powerful, but also scales down to use as a more efficient, more expressive PCRE replacement. There are various trade-offs in using PEGs rather than LALR(1), LL(1), etc., but the integration with the rest of the language is very good. It feels like using a better form of REs, rather than "adding a bunch of parsing machinery".

* Though nothing about its design is Lua-specific. There's a paper which explains its underlying mechanism, and it's just a small C library (2258 loc for v. 0.9) - porting it wouldn't be that difficult.

I agree with you about an actual markdown grammar, though it's useful enough that I have a hard time complaining. (I particularly like Discount (http://www.pell.portland.or.us/~orc/Code/discount/).)

_delirium · on July 28, 2010

I'm not aware of any parsing tools that make it particularly easy to scan large amounts of arbitrary text for substrings matching particular criteria. It's probably possible to do using lex/yacc, but it's certainly not straightforward, and I wouldn't be surprised if the matching were much slower than a good regexp engine. Parsers seem to be aimed mainly at the case of highly-structured inputs where the stuff you want to find is the entirety of the input, as with source code. Not so easy to use for the case where you have potentially gigabytes of noise, with some signal hiding in it that you want to recognize.

bruceboughton · on July 28, 2010

Agreed. I tried implementing a parser for an island grammar (Django template syntax) using lex/yacc-type tools. It was hard to find any examples of such using these tools. In the end, I wrote it by hand (I'm sure there were better ways to go about this).

pornel · on July 28, 2010

For those wishing to use heavy machinery I recommend Adium's library:

http://trac.adium.im/wiki/AutoHyperlinksFramework

It's tweaked to properly support lots of edge cases.

nerfhammer · on July 28, 2010

Honest question: why hasn't anyone come up with a system for "context-free" expressions?

silentbicycle · on July 28, 2010

There are plenty of tools for dealing with context-free grammars. I'm not sure what distinction you're making by referring to context-free expressions.

I like ocamllex + ocamlyacc, or LPEG for lighter stuff. Flex and yacc are ok, but C is really not my first choice for string-heavy stuff.

nerfhammer · on July 30, 2010

"context-free expression" = to context-free grammars as regular expression is to regular grammars

silentbicycle · on July 30, 2010

I'm still not 100% sure, but I think LPEG (http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html) fits the bill. Ignore for now that it's a Lua library - there's nothing Lua-specific about its underlying design.

There's a great paper (http://www.inf.puc-rio.br/~roberto/docs/peg.pdf) about it. It's a PEG-based string pattern-matching engine. It can build parse trees out of strings or execute code on portions of them just as easily as it can validate their structure, though in the documentation that's a bit of an afterthought. It's also very fast, and integrates as nicely with Lua as regular expressions do in Perl. It's also just a smallish C library that could be (maybe already has) been ported to Python / Ruby / whatever.

njharman · on July 28, 2010

Why bother matching balanced parens? just match everything until nonencoded space.

These are valid urls, no?

  example.com/(
  example.com/dkjflkj)sdkfj(/.
  example.com/.
  example.com//////:
  example.com/anycharacters_in_any_order_as_long_as_certain_ones_are_encoded

There's no way to tell if trailing punctuation is part of valid url or not. You can assume trailing punc is not and chop it off. Which should be correct 99.999 of the time. Similar with surrounding braces,brackets,parens. If you see one at start assume the one at end is not part of url.

drivebyacct2 · on July 28, 2010

Ironically that solution doesn't work for lots of wikipedia links that end in a paren.

_delirium · on July 28, 2010

There's actually a lot of Wikipedia articles that end with punctuation, which I get bitten by at HN relatively frequently. For example, there are a bunch of Supreme Court cases involving companies that end with an "L.L.C.", "Inc." or "Co.", like: http://en.wikipedia.org/wiki/Riegel_v._Medtronic,_Inc.

santry · on July 28, 2010

What's the purpose of matching "www.", "www1.", "www2." … "www999."? For purposes of this regex, wouldn't the domain matching that follows be sufficient?

mturmon · on July 28, 2010

referring to

http://daringfireball.net/misc/2010/07/url-matching-regex-te...

you can see he wants to match things like www.example.com, but not filename.txt

So www. can introduce a url also.

eli · on July 28, 2010

But what about my .mueseum domains?