Pattern Matching With Regexes

SnoopFrogg · (This post was last modified: May-12-2019, 02:19 PM by SnoopFrogg.)

I'm currently reading "Automating The Boring Stuff With Python" and I had a quick question about an example program the author includes at the end of the chapter. The code is as follows:

#! Python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard
import pyperclip, re

# Create phone number regex
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?              # Area code
(\s|-|\.)?                      # Separator
(\d{3})                         # First 3 digits
(\s|-|\.)                       # Separator
(\d{4})                         # Last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  # Extension
)''', re.VERBOSE)

# Create email regex
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+          # Username
@                          # @ symbol
[a-zA-Z0-9.-]+             # Domain name
(\.[a-zA-Z]{2,4})          # Dot-something
)''', re.VERBOSE)

# Find matches in clipboard text
text = str(pyperclip.paste())
matches = [] # Store the matches found

for groups in phoneRegex.findall(text):
    # phoneNum contains a string built from groups 1, 3, 5 and 8 of the matched text
    # These groups are the area code, first three digits, last four digits, and extension
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)
# Append group 0 of each match to get the entire regular expression 
for groups in emailRegex.findall(text):
    matches.append(groups[0])
# Copy results to the clipboard
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')

I get that line 25 creates an empty lists to store the matches found but what confuses me is lines 30-33. How do you know which part of the string is part of what group? I'm asking this because I'm working on a similar problem where I need to find website URLs that begin with http:// or https://. I also have the code to that program if you'd like to see what I have so far. Thanks in advance!

MvGulik · (This post was last modified: May-12-2019, 04:49 PM by MvGulik.)

When it comes to RE's its general useful to take a peek at the full RE output, preferably nicely formatted.
Suggested change to your code to allow for that:

## edit_1) add json to import
import pyperclip, re, json

## edit_2) add this for an easy way to generate pre-formatted output.
def jsp(obj):
	return json.dumps(obj, sort_keys=True, indent='\t', default=lambda o: repr(o))

## (skipping some code)

matches matches = [] # Store the matches found

## edit_3) print/debug full RE results.
if(1): ## to disable this debug part set (1) to (0).
	print( 'phoneRegex: ', jsp( phoneRegex.findall(text) ) )
	print( 'emailRegex: ', jsp( emailRegex.findall(text) ) )

for groups in phoneRegex.findall(text):
## (rest of code)

PS: Including some sample input data with questions is a good way to increase the change of a replay.
(Suggested code was not tested ... due to lack of sample input data)

---

Forgot: Good RE resource site: www.regular-expressions.info

***snippsat*** · May-12-2019, 05:26 PM

(May-12-2019, 02:19 PM)SnoopFrogg Wrote: I get that line 25 creates an empty lists to store the matches found but what confuses me is lines 30-33. How do you know which part of the string is part of what group?

Using re.findall it will match all groups.
So at index 0 first group,index 1 second group ect.
Example.
Using first re.search then can call single groups.

>>> import re 
>>> 
>>> text = '11223355'
>>> r = re.search(r'(\d{2})(\d+)', text)
>>> r.group(1)
'11'
>>> r.group(2)
'223355'

Now re.findall.

>>> import re 
>>> 
>>> text = '11223355'
>>> r = re.findall(r'(\d{2})(\d+)', text)
>>> r
[('11', '223355')]

So as mention index 0 will be 11(can call that group 1).
Loop and join would be.

>>> import re 
>>> 
>>> text = '11223355'
>>> r = re.findall(r'(\d{2})(\d+)', text)
>>> for group in r:
...     ''.join([group[0], group[1]])
...     
'11223355'

SnoopFrogg · (This post was last modified: May-12-2019, 09:23 PM by SnoopFrogg.)

(May-12-2019, 04:49 PM)MvGulik Wrote: PS: Including some sample input data with questions is a good way to increase the change of a replay.
(Suggested code was not tested ... due to lack of sample input data)

Sorry about not including any input data... the input data given is as follows:

Contact Us

No Starch Press, Inc.
245 8th Street
San Francisco, CA 94103 USA
Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)
Fax: +1 415.863.9950

Reach Us by Email

General inquiries: [email protected]
Media requests: [email protected]
Academic requests: [email protected] (Please see this page for academic review requests)
Help with your order: [email protected]

Reach Us on Social Media
Twitter
Facebook
Instagram
Pinterest

Also, I figured out that the first for loop detects phone numbers in more than one format. Which why the program appends the phone number in a single format. The phoneNum stores the string built from groups 1, 3, 5 and 8 of the matched text.

Quote:Using re.findall it will match all groups.
So at index 0 first group,index 1 second group ect.

Thanks for clearing this up for me. I realized the first for loop finds multiple phone numbers with different formats and the program appends the number into a single, standard form.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Finding files matching pattern	GrahamL	1	2,456	Jan-14-2022, 01:16 PM Last Post: DeaD_EyE
	How to get full path of specified hidden files matching pattern recursively	SriRajesh	4	6,577	Jan-18-2020, 07:12 PM Last Post: SriRajesh
	Searching a text file to find words matching a pattern	Micael	3	113,051	Nov-07-2017, 08:52 PM Last Post: Micael

Pattern Matching With Regexes

User Panel Messages

Announcements