5

I want to find URLs in strings where the link is not already in a link

My Current Code:

$text = "http://www.google.com is a great website. Visit <a href='http://www.google.com' >http://google.com</a>" $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/"; if(preg_match($reg_exUrl, $text, $url)) { $links = preg_replace($reg_exUrl, '<a href="'.$url[0].'" rel="nofollow">'.$url[0].'</a>', $_page['content']['external_links']); } 

The problem with this is it is returning the link twice (This is what it is returning):

<a href="http://www.google.com" rel="nofollow">http://www.google.com</a> is a great website. Visit <a href='<a href="http://www.google.com" rel="nofollow">http://www.google.com</a>' ><a href="http://www.google.com" rel="nofollow">http://www.google.com</a></a> 
3
  • If you only want to compute links that are not already in a <a> tag, then could probably edit your regex so that you can check if your link isn't surrounded by simple or double quotes Commented May 13, 2015 at 10:55
  • it isn't so simple because an url can be in several places in an html document (in an href attribute, in a src attribute, in a DTD, or in a javascript code). So a better way consists to extract text nodes of your document that are not a child of a link node (or script/style node) and to make the replacement. Commented May 13, 2015 at 11:33
  • Use preg_replace_callback Commented Dec 23, 2020 at 14:19

2 Answers 2

0

I made the assumption here that a URL you want to match would be followed by either whitespace, punctuation, or be at the end of a line. Of course if there is something like <a href="site">http://url </a> then it won't work quite as well. If you expect to encounter that, then first replace all \s+</a> with </a>

$text = "http://www.google.com is a great website. Visit <a href='http://www.google.com' >http://google.com</a>, and so is ftp://ftp.theweb.com"; $reg_exUrl = "/((http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3})([\s.,;\?\!]|$)/"; if (preg_match_all($reg_exUrl, $text, $matches)) { foreach ($matches[0] as $i => $match) { $text = str_replace( $match, '<a href="'.$matches[1][$i].'" rel="nofollow">'.$matches[1][$i].'</a>'.$matches[3][$i], $text ); } } 

Output:

http://www.google.com is a great website. Visit http://www.google.com' >http://google.com, and so is ftp://ftp.theweb.com

Sign up to request clarification or add additional context in comments.

3 Comments

this doesnt add nofollow to the url that is already a html link
@GillesMisslin I suppose I could take a new look at this. What text did you try it with?
This is the wrong way to go about it. preg_replace_callback is the right function to use here.
0

The proper way of doing it would be to use DOMDocument to parse your HTML code. Then iterate recursively over children. Skip nodes that have tagName equal to "a". Then analyse the textNodes and if they are not part of node, then replace the textNode with a node and put the textNode value in it.

Finally use saveHTML to get back html string.

manual about loading html: https://www.php.net/manual/en/domdocument.loadhtml.php

Stack Overflow ticket about iterating over childNodes: Loop over DOMDocument


Here is another quick version for your specific case:

<?php $input = "http://www.google.com is a great website. Visit <a href='http://www.google.com' >http://google.com</a>"; $output = preg_replace_callback("/(^|[^\"'>])(https?:\/\/[^ \n\r]+)/s",function($in){ $url = $in[2]; return "<a rel=\"nofollow\" href=\"$url\">$url</a>"; }, $input); 

As you can see, we are using trick regex to look for http/https links that do not seem to be part of a tag. Note, that this will not work for cases like <b>https://google.com</b>. If you need more advanced solution, you should either go with DOMDocument or alternatively you can check the text before the each occurrence of the https? marker.


And here is an even better version, as suggested by @mickmackusa:

<?php $input = "http://www.google.com is a great website. Visit <a href='http://www.google.com' >http://google.com</a>"; $output = preg_replace_callback("~<a[^>]+>.*?</a>(*SKIP)(*FAIL)|(https?:\/\/\S+)~i",function($match){ return sprintf("<a rel=\"nofollow\" href=\"%s\">%s</a>", $match[1], $match[1]); }, $input); 

it uses ultra magic regular expression syntax with (*SKIP)(*FAIL) and a bit more elegant sprintf for building the replacement value.

7 Comments

Using a pattern delimiter other than / will prevent needing to escape literal forward slashes in your pattern. The s pattern modifier is pointless because you don't have . anywhere in your pattern. [^ \n\r] can be replaced with \S (and will include tabs in the excluded character list). Using sprintf() will help to clean up your return data and allow you to use more modern arrow function syntax. Case-insensitivity might be a reasonable feature for general use. I would probably use (*SKIP)(*FAIL) to explicitly exclude existing tags.
sounds good, go ahead add another answer with your suggested improvements!
I was going to, then realized that I already did. stackoverflow.com/a/53346522/2943403
(*SKIP)(*FAIL) is really cool idea. I noticed that you used that on single line chunks with one occurrence. does this also works properly with multiline input and multiple occurrences?
It sure does. 3v4l.org/DQK75
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.