Not Substring Regular Expressions

I’m trying to devise a regular expression that will find all or most img tags that don’t have alt attributes. <img[^>]*/> will find all the img elements (or at least most of them). And I can easily find those that do contain an alt attribute. However, I’m stumped when it comes to finding those that do not contain the substring alt. Any ideas?

Note that this expression does not have to be perfect. I can live with some false positives and negatives. This is just meant to do a quick first pass over documents that will later be validated so any cases I miss will later be found, and nothing will be changed or replaced without human inspection first. This is a pure search, not a search and replace.

It feels like I need some sort of not operator in regular expressions. What am I missing?

7 Responses to “Not Substring Regular Expressions”

  1. Xan Gregg Says:

    This works pretty well for me (in BBEdit), ignoring case:

    <img([^>a]|a[^l]|al[^t]|”.*”|’.*’)*?>

    It allows
    any character not a ‘>’ or ‘a’
    any ‘a’ not followed by an ‘l’
    any ‘al’ not followed by an ‘t’
    anything in quotes

    So it disallows ‘alt’ outside of quotes. Will give a false results for any other tags containing ‘alt’.

    You might also want to search for: alt=””.

  2. Minute Bol Says:

    yeah that’s a good one. here’s what i came up with:
    (?!]*\salt[^>]*>$)]*>
    seems to work.

  3. Minute Bol Says:

    hey rusty – no ‘preview’ button? let’s try that again:
    (?!<img[^>]*\salt[^=>]*=[^>]*>)<img[^>]*>
    matches the thing and then makes sure there was no ‘ alt=’ in the match, or something to that effect

  4. Ed Davies Says:

    What am I missing? XPath?

    Sorry, couldn’t help it.

  5. Peter Says:

    Something like this should be ok

    while( $html =~/
    (
    ]* #0 or more non >
    \s+alt\s*= #and an alt attribute
    )
    [^>]* #instead followed by 0 or more non >
    > #and a tag close
    )/xg)
    {
    print $1.”\n\n”;
    }

  6. Peter Says:

    Something like this should be ok

    (lets see if a pre tag can help me)

    while( $html =~/
    (
    ]* #0 or more non >
    \s+alt\s*= #and an alt attribute
    )
    [^>]* #instead followed by 0 or more non >
    > #and a tag close
    )/xg)
    {
    print $1.”\n\n”;
    }

  7. Neil Greenwood Says:

    I think Minute Bol’s second suggestion above uses Java RegEx’s negative look-behind assertion. My copy of “Java in a Nutshell (4th ed)” says that the pattern “must be a…fixed number of characters”, so I don’t think his suggestion will work exactly. There’s no such restriction on the negative look-ahead assertion, but I can’t think how to re-write the regex that way.

    HTH. Hwyl,
    Neil.

Leave a Reply