Regular Expressions in Python on Examples #1

Posted in October 7, 2020 by

Categories: Computer Science Programming Python

Tags: , , ,

Reading Time: 8 minutes

I wanted to write a post about regular expressions a long time ago. Now I decided it will come to daylight after a few hours of my work on it. In this post, I write about RegEx in Python on examples. I plan to write at least two posts about this topic. First, we will focus on basic problems like finding an email πŸ™‚ .

RegEx. What is it?

Writing a regular expression is very easy. It can be a more advanced chain of characters which we use to find some substring or set of substrings in other bigger string 🧐 .

For example below you can see a few examples of regex:

regex1 = "Poland"
regex2 = "someemail@gmail.com"
regex3 = "^Error:.*\\.$"
regex4 = "[0-9]{3}-[0-9]{3}-[0-9]{3}"

As you can see in the previous code listing regex can be normal string data types as well as string data types that have special characters that have special meaning.

As I mentioned in the preview of this post we will discuss regex on the problem-solving approach πŸ€“ . I could of course write here a basic definition of every special character but I think more interesting will be getting familiar with them by solving some real-life problems.

Let’s start πŸ‘‡ .

Finding an Email in Text Using RegEx

This is an image of some old mailbox.
Image 1. Some Old Mailbox.

To start working with regular expressions we will use re library in Python.

Let’s import it πŸ‘‡:

import re

Let us define our problem.

We would like to find an email address in some text. Whether it is for internet crawler which searches for email on the internet and writes them to a file or it is just a form of field validation. We would like to check if in the given text we have one or more valid emails πŸ‘Œ .

Basic Email Checking

At the start we will create a few Strings with text which contains a few valid and invalid emails:

text1 = "Lorem sit amet, someemail@gmail.com consectetur adipiscing elit."

text2 = "Suspendisse firstemail@domain.com ipsum, vel fermentum metus consectetur sed. Ante, in vulputate second_email@interia.pl felis placerat sed."

text3 = "Donec quis mateusz_123@hotmail.com tristique nibh, ac pretium sem. Curabitur suscipit sodales porta. Fusce eu congue odio kate_watson@o2."

text4 = "Curabitur @nathaniel.com viverra ornare nulla, ut tempus sapien aliquam placerat. Etiam mi purus, mollis ultrices libero vitae, john-123-oldgmail.com hendrerit commodo nunc."

Now we would like to find in those texts valid email addresses. We will start with text1. We will try to find someemal@gmail.com address.

First we have to construct our regex πŸ‘‡ :

email_regex = "[a-zA-Z0-9_-]+@[a-zA-Z0-9]+\\.[a-zA-Z]{2,5}"

Breaking RegEx Into Pieces

We can break an email into 4 separate parts. Local-part is everything before @ char. Then we have @ char. Later we have a sub-domain part. For example gmail. And finally, we have a domain-part like .com or .net.

We know that email can start with any lower or upper character or a digit and contain additional characters like “_” or “-“.

So the first part, the local-part, will have one or more a to z character or A to Z character in the alphabet, 0 to 9 digit, one or more “_” or “-” character.

That we define by the first part of our RegEx 😎 :

"[a-zA-Z0-9_-]+"

As we can see [] characters are responsible for a group of characters. If we put [A-Z] we will look for the upper character in the alphabet. The same is with digits or lower characters.

After [] we have + sign. This instruct function to look for one or more occurrences of previous characters. In this case, we tell the function to look for one or more A to Z, a to z in the alphabet, 0 to 9 in digits or “_” and “-” character.

Second part is @ char. This doesn’t require explanation in my opinion.

Later we have πŸ‘‡ :

"[a-zA-Z0-9]+\\."

This is the same a previously plus we have at the end a dot. This sub-domain part ends with a dot, as every email does.

We use a double slash sign to tell Python that we recognize dot sign not as a special character but as a normal dot. Without the double slash, Python would interpret dot as a special character that represents any character 🧐 .

[a-zA-Z]{2,5}

The last part, domain, has an additional {} group which represents a specific number of previous characters. In this case, we are looking for 2 to 5 occurrences of lower case letters or upper case letters.

So “net” or “com” would has 3 lower case letter and this will pass our regex.

Here is whole code πŸ‘ :

import re

texts = ["Lorem sit amet, someemail@gmail.com consectetur adipiscing elit.",
         "Suspendisse firstemail@domain.com ipsum, vel fermentum ",
         "metus consectetur sed. Ante, in vulputate second_email@interia.pl felis placerat sed."
         "Donec quis mateusz_123@hotmail.com tristique nibh, "
         "ac pretium sem. Curabitur suscipit sodales porta. Fusce eu congue odio kate_watson@o2.",
         "Curabitur @nathaniel.com viverra ornare nulla, ut tempus "
         "sapien aliquam placerat. Etiam mi purus, mollis ultrices libero vitae, "
         "john-123-oldgmail.com hendrerit commodo nunc."]

email_regex = "[a-zA-Z0-9_-]+@[a-zA-Z0-9]+\\.[a-zA-Z]{2,5}"

for text in texts:
    result = re.findall(email_regex, text)
    print(result)

And here are results of our code πŸ‘‡ :

Python console showing email regex results.
Image 2. Email RegEx Results.

Finally …

So this is it. As you see writing regular expressions is very simple. I would advise you to apply in this divide and conquer rule.

You must break your regular expression into parts and write them one by one, step by step.

As you see regular expression for email recognition is very easy 😊 . It may be more complex if we would include special characters but not much more.

Now, let us jump to our second example of regex usage, phone number recognition.

Finding a Phone Number on Website Using RegEx

This is an image of some old phone.
Image 3. Some Old Phone.

In this example, I will try to explain how to harvest a phone number from a website first using web scrapping and secondly using regular expressions.

We will use in this example a website I created once for one Scottish Company which service is apartments for rent 😎 .

This is quite an old version of their website. I created it in 2014 as I remember. If you would like to check out their actual website here you have a link.

The Plan …

So the plan is very simple πŸ˜‰ .

First, we will import the requests and re library. We will download the index.php web page from our training website using request lib.

Then we will find by re library our phone number in the downloaded source code of the website.

Work in Progress …

import requests
import re

When we have our libraries imported we can use requests lib to fetch our website to a variable a print all source code of this website to our console.

url = "https://mstem.net/projects/ta/index.php"
thistle = requests.get(url)
print(thistle.text)

Here is, partly, what we are looking for:

Python console output of web scrapping script and regex usage outcomes.
Image 4. Source Code of Downloaded Website.

Now we have to invent regular expression which will find us this telephone number.

This regular expression will be a bit more complex than previous one πŸ‘‡ :

tel_regex = "\+[0-9]{2}\((([0-9]{1})\))? *[0-9]{3,4}[- ]*[0-9]{3,4}([- ]*[0-9]{3,4})*"

Some of its parts are already familiar, like [] char group. We have * char which is a special char representing zero or more occurrences of a normal char.

We also have () chars which are special characters for grouping regular expression in separate groups.

Divide and Conquer

\+[0-9]{2}(\(([0-9]{1})\))?

So first we are looking for “+” char and then for 2 occurrences of any digit. It represents for example +48 number.

Then we enclose [0-9]{1} in () signs which tells Python that we group this element to check if it occurs only 0 or 1 times because at the end we have an “?” sign.

“?” sign tells Python that we are looking for something only 0 or 1 times.

*[0-9]{3,4}[- ]*[0-9]{3,4}([- ]*[0-9]{3,4})*

The last part should be very familiar to you. We are looking for 3 or 4 digits plus “-” or space char 3 times here.

So 333-333-333 would do the job as well as 555-5555 in our example.

And That’s It

Here is a full code πŸ‘ :

import re
import requests

url = "https://mstem.net/projects/ta/index.php"
thistle = requests.get(url)
print(thistle.text)
thistle_text = thistle.text

tel_regex = "\+[0-9]{2}\((([0-9]{1})\))? *[0-9]{3,4}[- ]*[0-9]{3,4}([- ]*[0-9]{3,4})*"

result = re.search(tel_regex, thistle_text)

print(result.group(0))

And here is screenshot from console with results ☺️ :

Python console with results of finding telephone number from a website.
Image 5. We Finally Have Found Searched Telephone Number.
print(result.group(0))

We use result.group(0) because re.search method return us an object and we have to extract our result from it.

Very helpful in creating or testing regular expression is this website. You should check it out πŸ‘‰ .

Long Journey Always Has An End

I hope that after this article regular expressions will be more familiar to you.

I am not an expert in Python, nor in regular expressions but by writing this article I extend my knowledge and learn a lot. Hopefully, I am also contributing to share this awesome knowledge with other people πŸ™‚ .

Just imagine how cool the use of these skills could be. You could build for example web scrapper which automatically collects companies’ telephone numbers from a defined set of URL addresses and automatically save them in your contacts list πŸ’ͺ .

You could set a whole business logic about this idea like for example Kamil Kwapisz did with his company ScrapeUp.

At the End …

If you like this article I would appreciate it if you could leave some comment, click a like button, or share it on social media.

I don’t collect any revenue from this blog but when I see that my work is useful for anyone It gives me additional motivation to keep it up.

Thank you for staying with me so long! I really appreciate it ☺️ .

Have a wonderful day.

A goodbye image for farewell reader.
Image 6. Thank You For Reading And Till Next Time.

Cheers πŸ‘‹ .


Leave a Reply

Your email address will not be published. Required fields are marked *

19 + 10 =