Tilted Forum Project Discussion Community  

Go Back   Tilted Forum Project Discussion Community > Interests > Tilted Technology


 
 
LinkBack Thread Tools
Old 11-22-2003, 06:37 AM   #1 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
So ya want an intermediate Perl lesson do ya?

Greetings, camel-heads!

This thread is for people who have the basics of Perl under their belt. I'm going to assume that you've got an understanding of variable types, control structure, built-in function calls, and subroutines. If not, then THIS THREAD is the one for you.

Assuming you've got all that under your belt, here's what I intend to cover in this thread (in no particular order):

- Regular Expressions
- Variable References (unless Juan does that in the other thread)
- Nested data structures
- Applications for Perl (CGI, Data Munging, Database Interfaces)
- Using Object-Oriented Perl Modules

We'll save the creation of OO modules for the "so you want an ADVANCED perl lesson do ya?" thread.

Juanvaldez and I may be trading some of these topics off. Any other JAPH's who want to contribute, just PM him or me and say what topic you're interested in taking.

I intend to start with Regular Expressions, probably a couple-three little lessons on them.

Last edited by ratbastid; 11-22-2003 at 07:10 AM..
ratbastid is offline  
Old 11-22-2003, 06:59 AM   #2 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
Regular Expressions part I

A Regular Expression (hereafter "regex") is a pattern of characters defined by a specific format or "sub-language". The regex sub-language is simultaneously extremely terse and extremely expressive.

Regular expressions are a concept that's not unique to Perl, by the way. Perl supports a fairly broad extension of standard regex syntax. In this lesson I'll be starting with the basics and going to advanced topics in later lessons. I don't promise I'll stay away from Perl-specific stuff here.

To the uninitiated (and even the initiated!), they can be completely mind-boggling. I've got employees who have been completely proficient in Perl for years who still come to me with their eyes crossed about some regular expression or other. I believe it's the regex sub-language that gives Perl the reputation of being about as readable as modem line noise.

The point of a regular expression is to test whether the pattern described in the expression is found (or "matched") in a given string. I'll say that another way. You hand the regex engine a string and a pattern, and you ask, "Is this pattern found in that string?"

The basic structures of a regex are the "=~" and "m//" operators. The "=~" is an equals-like operator that tells the Perl einge to set up for a match. The two slashes in "m//" are delimiters on the expression itself--in fact, the "m" is optional in real live Perl scripts. A regular expression returns true or false--either it matched the given string, or it didn't.

Here's an example of a regular expression in action:

Code:
my $matchtext = 'The rain in Spain falls mainly on the plain';

if ($matchtext =~ /ain/) {
  print "Found a match!\n"; #we would expect to see this
}
Couple things to notice here. First, notice the equals-like operator that cues Perl to know we're talking about a regular expression, "=~". The mnemonic I use for that is "approximating". There's a "not found in" operator too, spelled !~, which simply inverts the return of the test.

Second, notice that I left the "m" off the m// operator. Totally legal, in fact I recommend it for readability purposes. I'll be doing that in all my examples.

Third, notice what's between the slashes. "ain". That means, match the LITERAL STRING "ain". Which the regex engine did ONCE AND ONLY ONCE: it found it at "The rain in Spain falls mainly on the plain." Regular expressions match left-to-right. If the expression had been /lain/, it would have found it only in the word "plain" at the end of the string.

These are the basics of how to spell a regular expression. Next lesson we'll get into wildcard globbing, alternation, and backreferences.

Last edited by ratbastid; 11-22-2003 at 07:07 AM..
ratbastid is offline  
Old 11-22-2003, 08:30 AM   #3 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
Regular Expressions part II

So our example from the last lesson wasn't all that interesting. How hard is it to find a static string in another static string? Booooring!

I'm going to introduce now a syntax that I'll use to demonstrate regular expressions, which look like this:

Code:
my @teststrings = ('string1', 'string2');
foreach my $string (@teststrings) {
  if ($string =~ /regex/) {
    print "$string matched!\n";
  } else {  # optional else clause
    print "$string didn't match!\n";
  }
}
I'll embed in there comments about which we'd expect to match, given the values of string1, string2, and regex.

Oooookay! So what if you have a list of names, and you want to find everyone in that list with your same first name. Just for fun, we'll say your name is Stu.

Your list looks like this:

Dave White
Adrian Wapcaplet
Stu Peterson
Sally Smith
Michael Jones
Stu Watkins
Micheline Mann
Stu Pott

Here's the code I'd write to pick out the "Stu"s:

Code:
my @names = ('Dave White','Adrian Wapcaplet','Stu Peterson',
  'Sally Smith','Michael Jones','Stu Watkins','Micheline Mann','Stu Pott');

foreach my $name (@names) {
  if ($name =~ /Stu .*/) {
    print "$name matched!\n";
  }
}
The output we'd expect from this is:<pre>Stu Peterson matched!
Stu Watkins matched!
Stu Pott matched!</pre>
So let's look closer at the regex I wrote there. What I said was "/Stu .*/". There are two special characters in here. The dot (spelled ".") means "any character". (Strictly speaking, it means "any character except a newline, unless the regex is modified with /s", but for now let's ignore that.)

So . means "any character". And * means "zero or more of the preceding thing". So together they're "zero or more of any character". So all strung together, this regular expression says, in the given string, match the literal characters S, t, u, and a space, followed by zero or more of any character. Hence the Stus show up as matched.

Now... What if I want to make sure I only get Stu's with last names. ANYTHING that begins "Stu " (notice the space) will be matched by that string, even if there's nothing following that. We'd match zero characters at the end of "Stu ", and the regex would be perfectly happy with that.

So fine, instead of "*" (zero or more), we want to use "+", meaning ONE or more. The string "Stu " is NOT matched by the regular expresssion /Stu .+/. That'll match Stu Meatt and Stu Pidd, but it won't match Stuart Little or plain old Stu.

Be sure you're clear on the above before reading further in this lesson.

Okay, next thing. Let's say we want to grab the last names off of there? So we've got those in a variable to use all by themselves? Maybe later we'll want to alphebatize the Stus by last name or something. Here's how we do that.

Surround any portion of a regular expression with (parentheses), and you'll capture its matched value in a magic variable called $1. The second set of parens will be stored in $2, the third in $3, etc.

So this code:
Code:
my @names = ('Dave White','Adrian Wapcaplet','Stu Peterson',
&nbsp;&nbsp;'Sally Smith','Michael Jones','Stu Watkins','Micheline Mann','Stu Pott');

foreach my $name (@names) {
&nbsp;&nbsp;if ($name =~ /Stu (.+)/) {
&nbsp;&nbsp;&nbsp;&nbsp;print "$name has the last name $1!\n";
&nbsp;&nbsp;}
}
will print out:<pre>Stu Peterson has the last name Peterson!
Stu Watkins has the last name Watkins!
Stu Pott has the last name Pott!</pre>

Just be sure to assign $1 into another variable or push it onto an array or something--the next regular expression will overwrite it.

We covered a lot of ground so far, so I'm going to leave some time for questions before my next tutorial. Fire away, people!

Last edited by ratbastid; 11-22-2003 at 08:35 AM..
ratbastid is offline  
Old 11-22-2003, 10:31 AM   #4 (permalink)
Huggles, sir?
 
seretogis's Avatar
 
Location: Seattle
Bless you, sir.
__________________
seretogis - sieg heil
perfect little dream the kind that hurts the most, forgot how it feels well almost
no one to blame always the same, open my eyes wake up in flames
seretogis is offline  
Old 11-22-2003, 04:12 PM   #5 (permalink)
Psycho
 
sprocket's Avatar
 
Location: In transit
Also, heres a little tidbit that makes practicing or refining regular expressions very easy using perls command line options.

From your command line type:

# perl -ne 'print if /foo/'

Then you can see if the regex works the way you expect by typing in possible matches and it will echo it back to you. If it doesnt match it wont echo it. Then just ctrl-c out of it and modify the regex as necessary.

For refining my substitutions I use:

# perl -pi -e 's/foo/oof/'

Never forget the power of using perl as a command line tool as well as a programming language! I am amazed at its flexibility all the damn time. For example if you need to do the same substition on a few hundred files under a directory you can try something like this:

# perl -pi -e 's/foo/oof/g' *.txt

Kick ass! Run 'perldoc perlrun' for even more possibilities.

If someone had shown me those tricks when I was struggling with regexes it would have made my life much much easier.
__________________
Remember, wherever you go... there you are.
sprocket is offline  
Old 11-22-2003, 04:13 PM   #6 (permalink)
Banned
 
Location: shittown, CA
nice tip sprocket
juanvaldes is offline  
Old 11-23-2003, 08:17 AM   #7 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
Quote:
Originally posted by sprocket
# perl -ne 'print if /foo/'
AWESOME! I love Perl!
ratbastid is offline  
Old 11-25-2003, 07:22 AM   #8 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
Regex Metacharacters

Just a quickie today that will set us up for additional Regex Twitchiness in just a bit.

There are certian characters that have special meaning in the syntax of a regular expression. These are: <pre>{}[]()^$.|*+?\</pre>
I'm not going to go over the meaning of each of those now, but I will talk about escaping those within a regular expression. Consider:
<pre>"2+2=5" =~ /2+2/; </pre>This expression doesn't match. I said above that "+" means "one or more of". So it's looking for one or more 2's followed by a 2. No such match in the given string.

So what do you do when you mean a literal + ? You escape that plus-sign with a backslash:

<pre>"2+2=5" =~ /2\+2/;</pre>THAT expression matches.

<pre>"The interval is [0.1}." =~ /[0,1}./;</pre>That's a syntax error. ("unmatched [ in regex", to be specific)

<pre>"The interval is [0,1}." =~ /\[0,1\)\./;</pre>That one's perfectly legal AND a match.

Sometimes when you have \ or / characters in the string you're trying to match. You have to escape those too. This can lead to LTS (Leaning Toothpick Syndrome), which is one of the things that can make regexes hard on the eyes.
ratbastid is offline  
Old 11-25-2003, 12:17 PM   #9 (permalink)
Banned
 
Location: 'bout 2 feet from my iMac
aah, so THAT'S what it's called... I've been having some LTS run-ins lately w/ Lex scripts... here's my favorite so far:
(\{)([^\}])*(\})
(that's supposed to match a {, followed by ANYTHING, followed by a }. (It's to detect Pascal comments). the Paren's may be extraneous, i dunno, but damn it, I'm a coder, extra parens never hurt no one!

keep it goin' RB, how do we deal w/ LTS? are there ways around it?
cheerios is offline  
Old 11-25-2003, 12:28 PM   #10 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
LTS Avoidance Mechanism #1: Alternate Delimiters

Why yes, my dear cheerios, there IS an easy way around LTS (Leaning Toothpick Syndrome).

Remember our basic regex operator, the "m//" operator? And how we can safely drop the "m" from it?

Well, here's the magic trick: ANY character after a lone "m", Perl will take to be the delimiter for a regular expression.

So the first line I wrote above:<pre>"2+2=5" =~ /2+2/;</pre>
could have been written:<pre>"2+2=5" =~ m|2+2|;</pre>
or:<pre>"2+2=5" =~ m"2+2";</pre>
You get the picture.

Here's the reason you'd do that. You only have to escape the fore-slash"/" character when you're using that as the delimiter. So if I wanted to know if a filename was in a particular directory, I could match like this:<pre>$filename =~ /\/usr\/local\/bin/;</pre>
Or I could just swap delimiters and not have to escape those slashes:<pre>$filename =~ m*/usr/local/bin*;</pre>

Ya dig? Just use a delimiter that won't show up in your pattern, and your escaping troubles get much easier to deal with.

This is helpful when you're trying to split pipe-delimited text data. It makes my eyes cross to say:<pre>my @values = split(/\|/, $line);</pre>
That "/\|/" construction is just painful. m'\|' is much better to me.
ratbastid is offline  
Old 11-25-2003, 12:30 PM   #11 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
Quote:
Originally posted by cheerios

(\{)([^\}])*(\})
Yeah, the parens are extraneous. Unless you're saving values, of course. Though why you'd want to save the opening and closing squiggle I don't quite know.
ratbastid is offline  
Old 12-06-2003, 06:20 PM   #12 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
Greediness and multiples

By default, a regular expression matches "first and most". In other words, regular expressions are greedy. Observe:

Code:
#!/usr/bin/perl

my $string = 'ab abc abcd abcde';
$string =~ /(a.*c)/;
print $1;
We would expect the above code to print out<pre>ab abc abcd abc</pre>
That .* (meaning zero or more of any character) will match greedily--it matches the largest string it can, and still have the expression match.

Can you control this? Obviously, the potential for runaway matching exists, if you lose your head coding your expression.

Two things to do. One is to use te {x} operator instead of the splat (*) to specify a specific number of matches.

Obserruve:

Code:
#!/usr/bin/perl

my $string = 'ab abc abcd abcde';
$string =~ /(a.{4}c)/;
print $1;
We'd expect that code to print:<pre>ab abc</pre>
The {4} in that expression says we want exactly 4 of those "any character" characters. You can follow ANY character with a * or a {x}, by the way. The regex /ab{4}c/ will match the string "abbbbc".

Say you're searching for porn. Easy. Just run the regex /X{3}/.

Here's the next great thing about the {x} operator--it can take two arguments, a minimum and a maximum. It's spelled {x,x} that way.

Code:
#!/usr/bin/perl

my $string = 'ab abc abcd abcde';
$string =~ /(a.{6,8}c)/;
print $1;
That should print:<pre>ab abc abc</pre>
The {6,8} says "find me at between six and eight anythings.

You can leave one of those operatands out to say "no limit", too. "{4,}" means "at least four". "{,2}" means "zero, one or two".
ratbastid is offline  
Old 12-07-2003, 01:06 AM   #13 (permalink)
Quadrature Amplitude Modulator
 
oberon's Avatar
 
Location: Denver
I'm a python dude. Used to be a perl dude. If you haven't tried python I recommend it.

I still use perl -pi -e "s...g" <filelist> all the frigging time though. Python has no quick-n-easy analog.
__________________
"There are finer fish in the sea than have ever been caught." -- Irish proverb
oberon is offline  
Old 12-07-2003, 05:25 AM   #14 (permalink)
Darth Papa
 
ratbastid's Avatar
 
Location: Yonder
Python is on my List of Languages to Get to Know Better. Ruby too.
ratbastid is offline  
 

Tags
intermediate, lesson, perl


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -8. The time now is 11:14 AM.

Tilted Forum Project

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
Search Engine Optimization by vBSEO 3.6.0 PL2
© 2002-2012 Tilted Forum Project

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62