Web Crawler (Scrapper)

Python Web Scrapper

Crawling and Scrapping the United States Census Beareau

How Does The Program Extract Web Links From The HTML Code?

The Python program that I created extracts the absolute unique web links from the HTML code of the “Current Estimates” web link by first importing requests from the BeautifulSoup4 library to parse the HTML. By using BS4, the program is able to parse the HTML by searching for tags along with their children that contain specific elements with desired attributes, such as those of URL web links. Once the HTML is examined, the links containing a “href” tags are extracted via the soup function. This creates a list of each URL within the HTML of the “Current Estimates” web link. Foremost, the program takes the results from the search for tags containing a “href” and removes the duplicates, then repopulates the new list of results cleaned and free of redundant URLs. Then, the results of the extracted URIs are converted into a CSV file via script thus, It is in this way, how my Python program successfully extracts the unique web links from the HTML code.

What Criteria Was Used to Determine if a Link Was A Locator to Another HTML Page?

To determine if a link is a locator to another HTML page was to visually examine the records and its format entirely. This was achieved by the manual inspection of the source code via utilizing the ‘View Source” functionality within the browser. By doing this process, I was able to find patterns in the raw HTML code by analyzing the tags and the class of attributes within the record, specifically the tags containing the URL. The criteria used to determine if a link is a locator to another HTML page was by scraping the website using the link locator in which the elements with ‘a’ anchor tag. The specific code block that was utilized to perform this task was:

The request module is used on line 4 to grabs the webpage using the script: r = requests.get (‘https://www.census.gov/programs-surveys/popest.html’) and then by saving that response in an object called ‘r’. This parameter was used because the ‘a’ tag contains links to the unique URIs on the Census webpage. Once this parameter was used to find each element, the HTML was retrieved by using the python script: html=(r.text). From there, the HTML code was able to be examined and used to print the number of links before processing by using the following script, which counts 686 links before processing:

The Beautiful Soup Library is used to scrape the URLs from the webpage and stores it in the variable named “search_results”. The significance of variables in this script is that they hold the value of objects while allowing the variable to change. The HTML using the script located at line 9 and 10 in the python scrapper file:

A SET variable was used to store the data that was used to extract the unique URLs from the Census website, in which I named “search_results”. A SET variable was used in this python script because this statement assigns variable values that are not currently written to the binary log so that in the replication of scenarios affects only the host in which is executed. The code specifically used to execute this function is:

This code creates a set that contains all of the unique ‘hrefs’ from the Census website. This was achieved by parsing the page and returns only URLs in absolute relative indicators via removing ‘hrefs’ that start with No ‘href’ or start with ‘#http’, or starts with ‘#’, or start with ‘/’, or ends with ‘.gov’, and then appends the appropriate unique ‘href’ accordingly. Altogether, this block of code removes any unwanted string characters from the ‘hrefs’ or appends the appropriate characters needed. Before parsing was done, 686 links were scrapped from the website. Once parsing was done, a total of 308 ‘hrefs’ was returned in a list as absolute relative paths and exported into a CSV output named “MyExport.” The criteria that were used to export the output for the CSV file was the following python script:

This portion of the script located on line 33 exports the output of web links scraped from the United States Census Website into a CSV file by using the import CSV library/module. The exported file was saved into a variable named ‘f’, then written into comma delimiter values for the unique weblinks.

How Does The Program Ensures Relative Links Are Saved as Absolute URIs In The Output File?

The program ensures that the relative links are saved as absolute values in the output file by creating a SET that contains all of the unique ‘hrefs’ from the Census website to convert them into an absolute format. This can be achieved by parsing the page and returns only URLs in absolute relative indicators via removing ‘hrefs’ that start with No ‘href’, or start with ‘#http’, or start with ‘#’, or start with ‘/’, or end with ‘.gov’, and then appends the appropriate unique ‘href’ accordingly. Altogether, this block of code removes any unwanted string characters from the ‘hrefs’ or appends the appropriate characters needed. Before parsing was done, 686 links were scrapped from the website. Once parsing was done, a total of 308 unique ‘hrefs’ was returned in a list as absolute relative paths by using the above methods and exported into a CSV output named “MyExport”. The code used to specifically perform this is:

How Does The Program Ensures There Are No Duplicated Links In The Output File?

The program that I created ensures that there are no duplicated links in the output file by removing all of the duplicates from the ‘soup.find_all’ results and repopulating them into a new variable set with the now cleaned data. This only allows for links that are not duplicated to be populated into the CSV file as URIs. I assured this function worked because once I ran the ‘len (search_results)’ command, a total of 662 results was returned. However, after I ran the ‘len (my_list)’ command, only 661 results were returned, which tells me that a duplicate was removed in the process. The code that specifically executes this action is listed below:

The Python code I wrote to extract all the unique web links from the HTML code of the “Current Estimates” web link that point out to other HTML pages is:

I used the len(search_results) script to count the number of URLs extracted by the web scraper before processing, which counted 686 links. However, a total list of 308 was counted after processing in the csv output.

Author: Cod3bot

Owner/CEO