12

I am trying to convert, from a textarea input ($_POST['content']), all urls to link.

$content = preg_replace('!(\s|^)((https?://)+[a-z0-9_./?=&-]+)!i', ' <a href="$2" target="_blank">$2</a> ', nl2br($_POST['content'])." "); $content = preg_replace('!(\s|^)((www\.)+[a-z0-9_./?=&-]+)!i', '<a target="_blank" href="http://$2" target="_blank">$2</a> ', $content." "); 

Target link formats: www.hello.com or http(s)://(www).hello.com

But this seem to break any iframe, image or similar,

How is/are the right regex that will ignore urls in html tags?

Note: I know I need two expressions; one to detect no protocol links (like www.hello.com, so I need to prepend it) and another one to detect urls with protocol (so no need to prepend).

3
  • Can you give an example that is broken? Due to the (\s|^) this will only match if you have a space or the strings beginning in front of the URL. But within iframe and img you should have a ", don't you? Commented Sep 25, 2012 at 20:28
  • Will the posted content always contain HTML? Commented Sep 27, 2012 at 13:25
  • @Jack always should support it. Commented Sep 27, 2012 at 13:36

4 Answers 4

19
+50

Your code as it is should not be much of a problem within iframes and so on, because in there you usually have a " in front of your URL and not a space, as your pattern requires.

However, here is different solution. It might not work 100% if you have single < or > within HTML comments or something similar. But in any other case, it should server you well (and I do not whether this is a problem for you or not). It uses a negative lookahead to make sure that there is no closing > before any opening < (because this means, you are inside a tag).

$content = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$2" target="_blank">$2</a> ', $content." "); $content = preg_replace('$(\s|^)(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$2" target="_blank">$2</a> ', $content." "); 

In case you are not familiar with this technique, here is a bit more elaboration.

(?! # starts the lookahead assertion; now your pattern will only match, if this subpattern does not match [^<>] # any character that is neither < nor >; the > is not strictly necessary but might help for optimization * # arbitrary many of those characters (but in a row; so not a single < or > in between) > # the closing > ) # ends the lookahead subpattern 

Note that I changed the regex delimiters, because I am now using ! within the regex.

Unless you need the first subpattern (\s|^) for the URLs outside of tags as well, you can now remove that, too (and decrease the capture variables in the replacement).

$content = preg_replace('$(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." "); $content = preg_replace('$(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1" target="_blank">$1</a> ', $content." "); 

And lastly... do you intend not to replace URLs that contain anchors at the end? E.g. www.hello.com/index.html#section1? If you missed this by accident, add the # to your allowed URL characters:

$content = preg_replace('$(https?://[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." "); $content = preg_replace('$(www\.[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1" target="_blank">$1</a> ', $content." "); 

EDIT: Also, what about + and %? There are also a few other characters that are allowed to appear in a URL without being encoded. See this. END OF EDIT

I think this should do the trick for you. However, if you could provide an example that shows working and broken URLs (with the code you have), we could actually provide solutions that are tested to work for all of your cases.

One final thought. The proper solution would be to use a DOM parser. Then you could simply apply the regex you already have only to text nodes. However, your concern for the HTML structure is very restricted, and that makes your problem regular again (as long as you do not have unmatched '<' or '>' in HTML comments or JavaScript or CSS on the page). If you do have those special cases, you should really look into a DOM parser. None of the solutions presented here (so far) will be safe in that case.

Sign up to request clarification or add additional context in comments.

4 Comments

This is exactly what I needed. Thank you! Is it ok to add the + and % to the string like that, or do they need a /
@betaman I suppose you meant a backslash? If you put them inside the character class, then they don't need to be escaped, no. Outside of a character class, the + has to be escaped, but the % does not.
After trying many solutions this one is the one that did it. As I want to keep existing HTML intact and replace only links in the text. And I learned a bit more on regular expression. Thanks m.buettner!
This seems to work. An example of this code can be found on sandbox.onlinephpfunctions.com/code/…
18
  1. In my opinion url is everything that starts with https?:// and ends with space or end of the line (vertical space or so called new line).
  2. Because of the first point, images, links etc. will not be replaced, because they all start with " or > (except if link <a href=" http..."> starts with the space, but this is invalid html).
  3. Modifier /m tells the regex to match every line (so that matching described in the first point will work).
  4. Function nl2br() should be used after replacement (because of the links that start on the beginning of the line).
  5. Space before and after are added only if space originally exists in the $content (see $1 and $3 in the second parameter of the preg_replace() function).
  6. This solution supports domain names with special characters, like www.moški.si.

Input:

INPUT

Code:

<?php $content = preg_replace( '~(\s|^)(https?://.+?)(\s|$)~im', '$1<a href="$2" target="_blank">$2</a>$3', $content ); $content = preg_replace( '~(\s|^)(www\..+?)(\s|$)~im', '$1<a href="http://$2" target="_blank">$2</a>$3', $content ); $content = nl2br($content); 

Output:

Output

Edit:

Example of links without https?:// prefixes + example of single preg_replace() call (patterns & replacements are array):

$content = preg_replace( array( '~(\s|^)(www\..+?)(\s|$)~im', '~(\s|^)(https?://)(.+?)(\s|$)~im', ), array( '$1http://$2$3', '$1<a href="$2$3" target="_blank">$3</a>$4', ), $content ); $content = nl2br($content); 

enter image description here

7 Comments

The more downvotes, the less chances you getting that bounty if it gets to auto allocate.
I don't care about bounty! I care about knowledge. If my answer is incorrect, I would like to know WHY. Is that to much to ask from downvoter's?
I just told you what reason downvoters may have had to downvote.
If that is true, I can write only this: OMG and LOL! If this is really the reason, I will never again reply to questions with bounty.
@glavic I upvoted both your answer and m-buettner but note that he answered this correctly before you. I tested both of your answers and they both work albeit yours looks like a smaller (better) regex and doesn't include the restrictive a-z0-9 portion since domains names now can have many more characters and be in different languages
|
4

This has been done hundreds of times over before. On this page either m-buettner and glavić work fine although I like glivic's shorter expression.

Here's a good php resource to do it: http://code.iamcal.com/php/lib_autolink/

Repeats on Stackoverflow:

Decent in-depth article: - http://buildinternet.com/2010/05/how-to-automatically-linkify-text-with-php-regular-expressions/

1 Comment

THIS ANSWER REALLY HELPED! I used "autolink" and it worked ver well. Specially that autolink did NOT replace a link inside src attribute of an image
3

Let me suggest something less straight forward: split the input text into the html and non-html parts, then process the non-html parts with your regexp combining the text back into one piece. Smth. like:

 <?php $chunks = preg_split('/(<.*>)/Ums', $_POST['content'], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY); $result = ''; foreach ($chunks as $chunk) { if (substr($chunk,0,1) != '<') { /* do your processing on $chunk */ } $result .= $chunk; } 

Some additional advices:

  1. try to save the source text and do the transformation when displaying it. This will allow you to improve/fix your rendering code if in future you find a new problem/idea.
  2. (https?://)+ shouldn't be in brackets and you don't need +, cause it matches "https://https://some.com" - just put https?://[a-z0-9_./?=&-]+
  3. the same about (www.)+ :)

1 Comment

The m pattern modifier is useless when there is no ^ or $ (anchor metacharacters) in the pattern.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.