11-22-2003, 06:37 AM | #1 (permalink) |
Darth Papa
Location: Yonder
|
So ya want an intermediate Perl lesson do ya?
Greetings, camel-heads!
This thread is for people who have the basics of Perl under their belt. I'm going to assume that you've got an understanding of variable types, control structure, built-in function calls, and subroutines. If not, then THIS THREAD is the one for you. Assuming you've got all that under your belt, here's what I intend to cover in this thread (in no particular order): - Regular Expressions - Variable References (unless Juan does that in the other thread) - Nested data structures - Applications for Perl (CGI, Data Munging, Database Interfaces) - Using Object-Oriented Perl Modules We'll save the creation of OO modules for the "so you want an ADVANCED perl lesson do ya?" thread. Juanvaldez and I may be trading some of these topics off. Any other JAPH's who want to contribute, just PM him or me and say what topic you're interested in taking. I intend to start with Regular Expressions, probably a couple-three little lessons on them. Last edited by ratbastid; 11-22-2003 at 07:10 AM.. |
11-22-2003, 06:59 AM | #2 (permalink) |
Darth Papa
Location: Yonder
|
Regular Expressions part I
A Regular Expression (hereafter "regex") is a pattern of characters defined by a specific format or "sub-language". The regex sub-language is simultaneously extremely terse and extremely expressive.
Regular expressions are a concept that's not unique to Perl, by the way. Perl supports a fairly broad extension of standard regex syntax. In this lesson I'll be starting with the basics and going to advanced topics in later lessons. I don't promise I'll stay away from Perl-specific stuff here. To the uninitiated (and even the initiated!), they can be completely mind-boggling. I've got employees who have been completely proficient in Perl for years who still come to me with their eyes crossed about some regular expression or other. I believe it's the regex sub-language that gives Perl the reputation of being about as readable as modem line noise. The point of a regular expression is to test whether the pattern described in the expression is found (or "matched") in a given string. I'll say that another way. You hand the regex engine a string and a pattern, and you ask, "Is this pattern found in that string?" The basic structures of a regex are the "=~" and "m//" operators. The "=~" is an equals-like operator that tells the Perl einge to set up for a match. The two slashes in "m//" are delimiters on the expression itself--in fact, the "m" is optional in real live Perl scripts. A regular expression returns true or false--either it matched the given string, or it didn't. Here's an example of a regular expression in action: Code:
my $matchtext = 'The rain in Spain falls mainly on the plain'; if ($matchtext =~ /ain/) { print "Found a match!\n"; #we would expect to see this } Second, notice that I left the "m" off the m// operator. Totally legal, in fact I recommend it for readability purposes. I'll be doing that in all my examples. Third, notice what's between the slashes. "ain". That means, match the LITERAL STRING "ain". Which the regex engine did ONCE AND ONLY ONCE: it found it at "The rain in Spain falls mainly on the plain." Regular expressions match left-to-right. If the expression had been /lain/, it would have found it only in the word "plain" at the end of the string. These are the basics of how to spell a regular expression. Next lesson we'll get into wildcard globbing, alternation, and backreferences. Last edited by ratbastid; 11-22-2003 at 07:07 AM.. |
11-22-2003, 08:30 AM | #3 (permalink) |
Darth Papa
Location: Yonder
|
Regular Expressions part II
So our example from the last lesson wasn't all that interesting. How hard is it to find a static string in another static string? Booooring!
I'm going to introduce now a syntax that I'll use to demonstrate regular expressions, which look like this: Code:
my @teststrings = ('string1', 'string2'); foreach my $string (@teststrings) { if ($string =~ /regex/) { print "$string matched!\n"; } else { # optional else clause print "$string didn't match!\n"; } } Oooookay! So what if you have a list of names, and you want to find everyone in that list with your same first name. Just for fun, we'll say your name is Stu. Your list looks like this: Dave White Adrian Wapcaplet Stu Peterson Sally Smith Michael Jones Stu Watkins Micheline Mann Stu Pott Here's the code I'd write to pick out the "Stu"s: Code:
my @names = ('Dave White','Adrian Wapcaplet','Stu Peterson', 'Sally Smith','Michael Jones','Stu Watkins','Micheline Mann','Stu Pott'); foreach my $name (@names) { if ($name =~ /Stu .*/) { print "$name matched!\n"; } } Stu Watkins matched! Stu Pott matched!</pre> So let's look closer at the regex I wrote there. What I said was "/Stu .*/". There are two special characters in here. The dot (spelled ".") means "any character". (Strictly speaking, it means "any character except a newline, unless the regex is modified with /s", but for now let's ignore that.) So . means "any character". And * means "zero or more of the preceding thing". So together they're "zero or more of any character". So all strung together, this regular expression says, in the given string, match the literal characters S, t, u, and a space, followed by zero or more of any character. Hence the Stus show up as matched. Now... What if I want to make sure I only get Stu's with last names. ANYTHING that begins "Stu " (notice the space) will be matched by that string, even if there's nothing following that. We'd match zero characters at the end of "Stu ", and the regex would be perfectly happy with that. So fine, instead of "*" (zero or more), we want to use "+", meaning ONE or more. The string "Stu " is NOT matched by the regular expresssion /Stu .+/. That'll match Stu Meatt and Stu Pidd, but it won't match Stuart Little or plain old Stu. Be sure you're clear on the above before reading further in this lesson. Okay, next thing. Let's say we want to grab the last names off of there? So we've got those in a variable to use all by themselves? Maybe later we'll want to alphebatize the Stus by last name or something. Here's how we do that. Surround any portion of a regular expression with (parentheses), and you'll capture its matched value in a magic variable called $1. The second set of parens will be stored in $2, the third in $3, etc. So this code: Code:
my @names = ('Dave White','Adrian Wapcaplet','Stu Peterson', 'Sally Smith','Michael Jones','Stu Watkins','Micheline Mann','Stu Pott'); foreach my $name (@names) { if ($name =~ /Stu (.+)/) { print "$name has the last name $1!\n"; } } Stu Watkins has the last name Watkins! Stu Pott has the last name Pott!</pre> Just be sure to assign $1 into another variable or push it onto an array or something--the next regular expression will overwrite it. We covered a lot of ground so far, so I'm going to leave some time for questions before my next tutorial. Fire away, people! Last edited by ratbastid; 11-22-2003 at 08:35 AM.. |
11-22-2003, 04:12 PM | #5 (permalink) |
Psycho
Location: In transit
|
Also, heres a little tidbit that makes practicing or refining regular expressions very easy using perls command line options.
From your command line type: # perl -ne 'print if /foo/' Then you can see if the regex works the way you expect by typing in possible matches and it will echo it back to you. If it doesnt match it wont echo it. Then just ctrl-c out of it and modify the regex as necessary. For refining my substitutions I use: # perl -pi -e 's/foo/oof/' Never forget the power of using perl as a command line tool as well as a programming language! I am amazed at its flexibility all the damn time. For example if you need to do the same substition on a few hundred files under a directory you can try something like this: # perl -pi -e 's/foo/oof/g' *.txt Kick ass! Run 'perldoc perlrun' for even more possibilities. If someone had shown me those tricks when I was struggling with regexes it would have made my life much much easier.
__________________
Remember, wherever you go... there you are. |
11-25-2003, 07:22 AM | #8 (permalink) |
Darth Papa
Location: Yonder
|
Regex Metacharacters
Just a quickie today that will set us up for additional Regex Twitchiness in just a bit.
There are certian characters that have special meaning in the syntax of a regular expression. These are: <pre>{}[]()^$.|*+?\</pre> I'm not going to go over the meaning of each of those now, but I will talk about escaping those within a regular expression. Consider: <pre>"2+2=5" =~ /2+2/; </pre>This expression doesn't match. I said above that "+" means "one or more of". So it's looking for one or more 2's followed by a 2. No such match in the given string. So what do you do when you mean a literal + ? You escape that plus-sign with a backslash: <pre>"2+2=5" =~ /2\+2/;</pre>THAT expression matches. <pre>"The interval is [0.1}." =~ /[0,1}./;</pre>That's a syntax error. ("unmatched [ in regex", to be specific) <pre>"The interval is [0,1}." =~ /\[0,1\)\./;</pre>That one's perfectly legal AND a match. Sometimes when you have \ or / characters in the string you're trying to match. You have to escape those too. This can lead to LTS (Leaning Toothpick Syndrome), which is one of the things that can make regexes hard on the eyes. |
11-25-2003, 12:17 PM | #9 (permalink) |
Banned
Location: 'bout 2 feet from my iMac
|
aah, so THAT'S what it's called... I've been having some LTS run-ins lately w/ Lex scripts... here's my favorite so far:
(\{)([^\}])*(\}) (that's supposed to match a {, followed by ANYTHING, followed by a }. (It's to detect Pascal comments). the Paren's may be extraneous, i dunno, but damn it, I'm a coder, extra parens never hurt no one! keep it goin' RB, how do we deal w/ LTS? are there ways around it? |
11-25-2003, 12:28 PM | #10 (permalink) |
Darth Papa
Location: Yonder
|
LTS Avoidance Mechanism #1: Alternate Delimiters
Why yes, my dear cheerios, there IS an easy way around LTS (Leaning Toothpick Syndrome).
Remember our basic regex operator, the "m//" operator? And how we can safely drop the "m" from it? Well, here's the magic trick: ANY character after a lone "m", Perl will take to be the delimiter for a regular expression. So the first line I wrote above:<pre>"2+2=5" =~ /2+2/;</pre> could have been written:<pre>"2+2=5" =~ m|2+2|;</pre> or:<pre>"2+2=5" =~ m"2+2";</pre> You get the picture. Here's the reason you'd do that. You only have to escape the fore-slash"/" character when you're using that as the delimiter. So if I wanted to know if a filename was in a particular directory, I could match like this:<pre>$filename =~ /\/usr\/local\/bin/;</pre> Or I could just swap delimiters and not have to escape those slashes:<pre>$filename =~ m*/usr/local/bin*;</pre> Ya dig? Just use a delimiter that won't show up in your pattern, and your escaping troubles get much easier to deal with. This is helpful when you're trying to split pipe-delimited text data. It makes my eyes cross to say:<pre>my @values = split(/\|/, $line);</pre> That "/\|/" construction is just painful. m'\|' is much better to me. |
12-06-2003, 06:20 PM | #12 (permalink) |
Darth Papa
Location: Yonder
|
Greediness and multiples
By default, a regular expression matches "first and most". In other words, regular expressions are greedy. Observe:
Code:
#!/usr/bin/perl my $string = 'ab abc abcd abcde'; $string =~ /(a.*c)/; print $1; That .* (meaning zero or more of any character) will match greedily--it matches the largest string it can, and still have the expression match. Can you control this? Obviously, the potential for runaway matching exists, if you lose your head coding your expression. Two things to do. One is to use te {x} operator instead of the splat (*) to specify a specific number of matches. Obserruve: Code:
#!/usr/bin/perl my $string = 'ab abc abcd abcde'; $string =~ /(a.{4}c)/; print $1; The {4} in that expression says we want exactly 4 of those "any character" characters. You can follow ANY character with a * or a {x}, by the way. The regex /ab{4}c/ will match the string "abbbbc". Say you're searching for porn. Easy. Just run the regex /X{3}/. Here's the next great thing about the {x} operator--it can take two arguments, a minimum and a maximum. It's spelled {x,x} that way. Code:
#!/usr/bin/perl my $string = 'ab abc abcd abcde'; $string =~ /(a.{6,8}c)/; print $1; The {6,8} says "find me at between six and eight anythings. You can leave one of those operatands out to say "no limit", too. "{4,}" means "at least four". "{,2}" means "zero, one or two". |
12-07-2003, 01:06 AM | #13 (permalink) |
Quadrature Amplitude Modulator
Location: Denver
|
I'm a python dude. Used to be a perl dude. If you haven't tried python I recommend it.
I still use perl -pi -e "s...g" <filelist> all the frigging time though. Python has no quick-n-easy analog.
__________________
"There are finer fish in the sea than have ever been caught." -- Irish proverb |
Tags |
intermediate, lesson, perl |
|
|