Sometimes, you might need a lot of data and nobody never collected that kind of data before so you have to collect it by yourself. Well, maybe then you need to do Crawling to get what you want. Page in the internet build based on HTML which mean every content in the page is create inside a tag, so when you know the tag, you can take the data.

The pink text inside <> is called tag

Basically, the idea of crawling is you need to create a program to get data from the HTML page. You can do that by specified which tag that you want to take from the page. Let me show you how to do crawling in Samehadaku. Please don’t do this for any purpose that will bring harm, i hope you will do this for good purpose.

1. Go to the web that you want to crawl and check how they build the tag. If you are using Mozilla press F12 to look at the element or ctrl+shift+I in Chrome.

2. In this example i will take the title of Update Anime in page 1. In the picture below, it shows that the title is placed inside <h3> tag with class=”post-title”

Inspect Element

3. These are the package we need to do this.

import requests
from bs4 import BeautifulSoup

If you don’t have these package, you can install it by type in command prompt:
> pip install requests
> pip install beautifulsoup4

4. To make it simple, let’s put the core of our progam inside start function

def start(url):
   content_list = []
   source_code = requests.get(url).text
   soup = BeautifulSoup(source_code, ‘html.parser’)
   for post_text in soup.findAll(‘h3’, {‘class’: ‘post-title’}):

def start(url): is the name of our function and it takes URL of the website as input parameter.
content_list = [] is created so we can store the post title that we get later
source_code = requests.get(url).text means we want to get  the HTML file of the page url as a text, and we did it by using package requests
soup = BeautifulSoup(source_code, ‘html.parser’) source_code is a text, so it will be difficult to use is directly to find contain of a tag so we need to parse it by using this code
soup.findAll(‘h3’, {‘class’: ‘post-title’}):  is used to get text inside every <h3> whose the class are post-title
content_list.append(post_text.string) is to insert the post title into content_list

5. Call the start function and put the website’s URL as parameter. Run the program and you will get this output

Output of the Program

Pretty interesting isn’t it? Now with this simple knowledge you can create big things such as crawler for news, or maybe you can event make program to know if you new Anime has been released or not haha. Let me know what you think.

Categories: Python


Piccolo-chan · 14th August 2018 at 8:56 am

omaewa mo sindeiru 😐

    Fhadli · 29th September 2018 at 9:38 am

    memory card, men

Evan Feibusch · 23rd January 2019 at 10:14 am

Wow, this was awesome. Keep writing this kind of texts, you will get a lot of people to this blog if you continue writing this.

Thaddeus Munce · 29th January 2019 at 8:25 pm

Wow, this was usefull. Keep writing this kind of blogs, you will get a lot of people to this page if you continue working on this.

Leave a Reply

Your email address will not be published. Required fields are marked *