June 5, 2020

Website link scraper in python

By Mohit Agrawal

It crawls all the links present on a website. Website link scraper could be a great tool for pentesting

Overview

Hello coders πŸ‘‹ , In this article, I am going to cover the most demanding and interesting topic i.e website link scraper. Using this scraper script you will be able to get all the links present on the website. Let’s start with the coding part.

Introduction

Python is a great language if we want to do this type of stuff. I am using the following third-party libraries to achieve the goal.

  • BeautifulSoup
  • requests
  • csv
  • panda
  • os ( This module comes pre-installed with python itself πŸ•ΊπŸ»)

Please install all the above libraries before starting the project. If you face any technical problem while installing the libraries then please let me know in the comment section.

Implementation

I am using Visual Code here as my IDE, you are free to use anyone it doesn’t matter. Create a new python file and name it Scraper.py πŸ•΅πŸ». We will start by designing a function first whose job will be to crawl the URL and then write it in a csv file. The name of the function will be crawlHomeURL(URL).

#It crawls the Home url
def crawlHomeURL(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'lxml')
    for link in soup.find_all('a', href=True):
        found_link = link['href']
        if not (found_link and found_link.strip()):
            continue
        if found_link.startswith('#'):
            continue
        if found_link.startswith('/') or found_link.startswith('?'):
            writeDataToCSV(url+found_link)
        else:
            writeDataToCSV(link['href'])
    removeDuplicates(file_name)

The above function is the heart of this scraper. Now let me explain to you in detail. This function takes a URL as a parameter. Further, I am using a requests library to fetch all the HTML code details for the particular page. Then I am passing the requests data to BeautifulSoup library so that I can find all <a> tags. Because <a> is the one that contains the website links. Like this.

<a href="https://www.warmodroid.xyz">This is how 'a' tags contain links.</a>

I hope this crawl function is clear to you, now we need to code writeDataToCSV(url) and removeDuplicates(file_name) function.

#Write the link to the csv file.
#If file does not exist then it makes one with field name.
def writeDataToCSV(link):
    if not os.path.isfile(file_name):
        with open(file_name, mode='w') as csv_file:
            fieldNames = ['siteName', 'siteLink']
            writer = csv.DictWriter(csv_file, fieldnames = fieldNames)  
            # writing headers (field names)  
            writer.writeheader()  
            # writing data rows  
            writer.writerows({'siteName': site_name, 'siteLink': link})  
    else:
         with open(file_name, mode='a+') as write_obj:
            fieldNames = ['siteName', 'siteLink']
            dict_writer = csv.DictWriter(write_obj, fieldnames=fieldNames)
            # Add dictionary as wor in the csv
            dict_writer.writerow({'siteName': site_name, 'siteLink': link})

The writeDataToCSV() is used to write the url in the CSV file. Here I am using the os library to check that if files already exist or need to create a new one. Then I am using the csv library to write the URL in the csv file.

#It removes all the duplicate links from csv file
def removeDuplicates(file_name):
    data = pd.read_csv(file_name)
    data.drop_duplicates().to_csv(file_name, index=False)

Sometimes I was getting the same URL many times so I am using removeDuplicates() function to remove duplicate entries. To do this drop_duplicates() is the best function present in the panda library.

Now to run the code till here you can do like this.

url = 'https://warmodroid.xyz'
#res = requests.get(url)
file_name = 'link_list.csv'
site_name = 'Warmodroid'


crawlHomeURL(url)

Set the url which has to be crawled, file_name, and the site_name. And run the python code. You will find something like this.

Please use your own website address to run this script. It’s illegal to run scraper on any site without the site owner’s permission 😧.

python_scraper_demo

What do you think by looking at the number of links πŸ€” ? Do I have only 32 links on my website? No..right?

Hmm…something is still missing. Yes, this is scraper is able to crawl all the links available on the home page not on the whole website. So let’s fix this issue. To overcome this bug I have tried the recursive approach.

#It returns all links present in the csv file
def readDataFromCSV(file_name):
    with open(file_name, mode='r') as csv_file:
        links = []
        csv_reader = csv.reader(csv_file)
        for line in csv_reader:
            #print(line[1])
            links.append(line[1])
        return links
#It crawls all the URL present in Home page and again It crawls all the URL together which makes a recursive process.
#It will stop when it finds all the URL present in the given target URL
def crawlSubURL():
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'lxml')
    links = readDataFromCSV(file_name)
    for link in links:
        if link != 'siteLink':
            crawlHomeURL(link)

I have written one more function crawlSubURL(), it will crawl all the links present on the home page and so on. Now it will give complete links. Try it and let me know in the comment section.

Complete source code

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import os

#It returns all links present in the csv file
def readDataFromCSV(file_name):
    with open(file_name, mode='r') as csv_file:
        links = []
        csv_reader = csv.reader(csv_file)
        for line in csv_reader:
            #print(line[1])
            links.append(line[1])
        return links

#It removes all the duplicate links from csv file
def removeDuplicates(file_name):
    data = pd.read_csv(file_name)
    data.drop_duplicates().to_csv(file_name, index=False)

#Write the link to the csv file.
#If file does not exist then it makes one with field name.
def writeDataToCSV(link):
    if not os.path.isfile(file_name):
        with open(file_name, mode='w') as csv_file:
            fieldNames = ['siteName', 'siteLink']
            writer = csv.DictWriter(csv_file, fieldnames = fieldNames)  
            # writing headers (field names)  
            writer.writeheader()  
            # writing data rows  
            writer.writerows({'siteName': site_name, 'siteLink': link})  
    else:
         with open(file_name, mode='a+') as write_obj:
            fieldNames = ['siteName', 'siteLink']
            dict_writer = csv.DictWriter(write_obj, fieldnames=fieldNames)
            # Add dictionary as wor in the csv
            dict_writer.writerow({'siteName': site_name, 'siteLink': link})

#It crawls all the URL present in Home page and again It crawls all the URL together which makes a recursive process.
#It will stop when it finds all the URL present in the given target URL
def crawlSubURL():
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'lxml')
    links = readDataFromCSV(file_name)
    for link in links:
        if link != 'siteLink':
            crawlHomeURL(link)

#It crawls the Home url
def crawlHomeURL(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'lxml')
    for link in soup.find_all('a', href=True):
        #print(link.attrs)
        found_link = link['href']
        if not (found_link and found_link.strip()):
            continue
        if found_link.startswith('#'):
            continue
        if found_link.startswith('/') or found_link.startswith('?'):
            writeDataToCSV(url+found_link)
        else:
            writeDataToCSV(link['href'])
    removeDuplicates(file_name)

url = 'https://warmodroid.xyz'
file_name = 'link_list.csv'
site_name = 'Warmodroid'


crawlHomeURL(url)
crawlSubURL()



Subscribe YouTube: More tutorials like this

I hope this blog post is useful for you, do let me know your opinion in the comment section below.
I will be happy to see your comments down below πŸ‘.
Thanks for reading!!!

FAQ

  1. What is the use of 'os' library in python?

    OS module in python provides the function to interact with OS. This module comes pre-installed with python itself. For example, if you want to print your current working directory from python code then you can write something like print(os.getcwd()).

  2. What is BeautifulSoup?

    Beautiful Soup is a python library which is designed to scrap out the useful data from the HTML or XML. Suppose you need to find all the links on a particular web page. Then you can use this library. It will easily to this job for you in one line of code.

  3. What is panda in Python?

    Panda is a third party library which is designed to do data manipulation. Suppose you have a huge csv file and you want to filter out only some useful information out of it then panda will make your life easier.

android android beginners tutorial android tutorial Animation API call API call in iOS async asynchronous cardview custom switch custom ui dark web dark web links deep web deep web links Dynamic gradient view facebook messenger firebase hosting firebase hosting tutorial firebase remote config firebase tutorials flutter flutter for beginner flutter tutorial flutter UI tutorial hacking hacking tutorial hashcat iOS ios tutorial ios ui tutorial kotlin listview in flutter Lottie animation Messenger for kids review onion links Roll the Dice app swift tech news tor tor browser tor links UISwitch wifi wifi hacking