How to extract div tag and its contents by id using Beautiful Soup ?

Beautiful Soup is an extensively used library in Python for web scraping. This library parses the given webpage and provides the users with an easier way to navigate and access it.

When you inspect a webpage and skim through its HTML code, you’ll see that one of the most used tags are the div tags. These tags are used to make a division of the content in the webpage. If you’re scraping a blogging site, it becomes extremely important to learn how to extract the contents from div tags.

In today’s article, let’s learn different ways of extracting a div tag and its contents by id using the Beautiful Soup library.

To extract the data based on the id attribute of a div tag, we have to first identify these tags. We can do that in one of the following ways.

  • find()
  • find_all()
  • select_one()
  • select()
  • SoupStrainer class

Once we’ve filtered the required tags, we can fetch the content using tag.text

Consider a sample HTML document as shown below,

<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<div class="sister" id="link1">Elsie</>,
<div class="sister" id="link2">Lacie</div> and
<div class="brother"id="link3">Till</div>;
and they lived at the bottom of a well.</p>

<p class="story">Thts the end of the story</p>
</html>

Method 1: Use the find_all()

We can use the find_all() method to find all the div tags containing the required_id as shown below:

find_all("tagname", id ="required_id")

or

find_all(id="required_id")

Example: In the below programs let’s extract the data from all the id attributes having the value link2.

from bs4 import BeautifulSoup

with open("demo.html") as f:
    soup=BeautifulSoup(f,'html.parser')
    for ele in soup.find_all(id="link2"):
        val=ele.text

Output:

Lacie,
Laks

Alternately, you can also use the find() method to return the first occurrence that matches the criteria. Look at the below code snippet for more details.

from bs4 import BeautifulSoup

with open("demo.html") as f:
    soup = BeautifulSoup(f, 'html.parser')
    ele=soup.find(id="link2")
    print(ele.text)

Output:

Lacie,

 

Method 2: Use select() method

BeautifulSoup has the select() method that looks out for the CSS selectors in the parsed document. To use select(), use one of the below commands.

select("tagname#required_id")

or

select(#required_id)

Example:  Let’s extract the data from all the id attributes having the value link2 using select() in the below code snippet.

from bs4 import BeautifulSoup

with open("demo.html") as f:
    soup = BeautifulSoup(f, 'html.parser')
    for ele in soup.select("#link2"):
        val=ele.text

Alternately, you can also use the select_one() method to return the first occurrence that matches the criteria.

from bs4 import BeautifulSoup

with open("demo.html") as f:
    soup = BeautifulSoup(f, 'html.parser')
    ele=soup.select_one("#link2")
    print(ele.text)

 

Method 3: Use the SoupStrainer class

We can also use the SoupStrainer class to find all the div objects with the required id as the value to the id attribute. To use the SoupStrainer class, we have to first import the class. To import, use the below command

from bs4 import BeautifulSoup

Now, let’s use the SoupStrainer class to extract the content from id attribute whose value is “link2”.

from bs4 import BeautifulSoup,SoupStrainer

with open("demo.html") as f: 
    soup=BeautifulSoup(f,'html.parser',parse_only=SoupStrainer(id="link2"))
    for id_ele in soup:
        print(id_ele.text)

Output:

Lacie,
Laks

 

Time taken to execute using the above methods

Now that we’ve seen three different methods to extract the value from the id attribute within the div tag, let’s check which one is faster.

from bs4 import BeautifulSoup,SoupStrainer
from time import perf_counter_ns

with open("demo.html") as f:
    #Using the Soup Stariner class
    start=perf_counter_ns()
    soup=BeautifulSoup(f,'html.parser',parse_only=SoupStrainer(id="link2"))
    for id_ele in soup:
        val=id_ele.text
    end=perf_counter_ns()
    print("Time taken to execute with soup stariner- %8dns"%(end-start))

    #Using find_all()
    start = perf_counter_ns()
    soup_1=BeautifulSoup(f,'html.parser')
    for ele in soup_1.find_all(id="link2"):
        val=ele.text
    end = perf_counter_ns()
    print("Time taken to execute with find_all()- %8dns" % (end - start))

    #Using select()
    start = perf_counter_ns()
    soup_2 = BeautifulSoup(f, 'html.parser')
    for ele in soup_2.select("#link2"):
        val=ele.text
    end = perf_counter_ns()
    print("Time taken to execute with select() - %8dns" % (end - start))

Output:

Time taken to execute with soup stariner- 454000ns
Time taken to execute with find_all()- 63900ns
Time taken to execute with select() - 178100ns

Note that using the SoupStariner doesn’t really save you time. But it saves a lot of memory and it makes the document search much faster.

Conclusion

That brings us to the end of this article. In this article, we have seen different ways of extracting the div element and its contents by id using the Beautiful Soup library. We’ve also seen the time complexities involved in all the different methods. Do comment and let us know if this helped you.

Thanks for reading.

Also read, How does the Python for loop work?

If you enjoyed reading, share this article.

Anusha Pai is a Software Engineer having a long experience in the IT industry and having a passion to write. She has a keen interest in writing Python Errorfixes, Solutions, and Tutorials.

Leave a Comment