我想创建一个页面,其中所有驻留在我网站上的图像都列有标题和替代表示.
我已经给我写了一个程序来查找和加载所有HTML文件,但现在我被困在如何提取src
,title
并alt
从这个HTML:
src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
我想这应该用一些正则表达式完成,但由于标签的顺序可能会有所不同,而且我需要所有这些,我真的不知道如何以优雅的方式解析它(我可以通过char方式,但这很痛苦).
$url="http://example.com"; $html = file_get_contents($url); $doc = new DOMDocument(); @$doc->loadHTML($html); $tags = $doc->getElementsByTagName('img'); foreach ($tags as $tag) { echo $tag->getAttribute('src'); }
使用regexp来解决这类问题是一个坏主意,可能导致代码难以维护和不可靠.更好地使用HTML解析器.
在这种情况下,最好将流程分为两部分:
得到所有的img标签
提取元数据
我将假设您的doc不是xHTML严格的,因此您不能使用XML解析器.EG与此网页源代码:
/* preg_match_all match the regexp in all the $html string and output everything as an array in $result. "i" option is used to make it case insensitive */ preg_match_all('/]+>/i',$html, $result); print_r($result); Array ( [0] => Array ( [0] => [1] => [2] => [3] => [...] ) )
然后我们用循环获取所有img标记属性:
$img = array(); foreach( $result as $img_tag) { preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]); } print_r($img); Array ( [] => Array ( [0] => Array ( [0] => src="/Content/Img/stackoverflow-logo-250.png" [1] => alt="logo link to homepage" ) [1] => Array ( [0] => src [1] => alt ) [2] => Array ( [0] => "/Content/Img/stackoverflow-logo-250.png" [1] => "logo link to homepage" ) ) [] => Array ( [0] => Array ( [0] => src="/content/img/vote-arrow-up.png" [1] => alt="vote up" [2] => title="This was helpful (click again to undo)" ) [1] => Array ( [0] => src [1] => alt [2] => title ) [2] => Array ( [0] => "/content/img/vote-arrow-up.png" [1] => "vote up" [2] => "This was helpful (click again to undo)" ) ) [] => Array ( [0] => Array ( [0] => src="/content/img/vote-arrow-down.png" [1] => alt="vote down" [2] => title="This was not helpful (click again to undo)" ) [1] => Array ( [0] => src [1] => alt [2] => title ) [2] => Array ( [0] => "/content/img/vote-arrow-down.png" [1] => "vote down" [2] => "This was not helpful (click again to undo)" ) ) [