How to Scrape Data Foods Around Jakarta Using Selenium Zomato?
Thinking about what type of foods as well as drinks The Big Durian can offer? Let’s extract it using Selenium!
Jakarta has now entered in the 8th month of COVID-19 pandemic as well as from a way things are standing right now, this not providing any better. Their government has imposed on-as well as-off social restrictions within the city.
People are advised to stay-at-home as well as work-from-home, non-important industry are suggested to be temporarily closed and that’s comprising the dessert parlors or restaurants you love!
When social restrictions are on, ‘eating out’ gets stopped completely. We can purchase the ingredients and cook them ourselves. That’s called the ‘to cook’ alternative.
In the ‘not to cook’ alternative you can purchase takeaway food or order food online. Being a Jakartan that tries very hard for not contributing to the new cases within the city, we at times, order food online, thanks a lot to the growth of food-delivery apps like GrabFood or GoFood, which are similar to the Uber Eats.
Let’s go through a Python script with Selenium library for automating the scraping procedure through thousands of restaurant pages.
What Data We Will Scrape?
As there are lots of data available on Zomato, we need to list data, which we require in our research. After testing the Zomato page, we have decided that we will scrape:
- Restaurant’s Name
- Type
- Area
- Ratings
- Reviews
- Average Pricing for 2
- Address
- Additional Facilities or Info
- Latitude & Longitude
Basic Preparations
As we’re going to use Selenium, the initial step is making sure that we have the required library called Selenium. As Selenium is a library for automating the procedure in a web browser, we will need an actual browser installed in the computer as well as the browser driver for controlling it.
We will utilize Google Chrome and you may download a browser driver there.
from selenium import webdriver # Set Windows path where WebDriver is located -> to be used for Selenium chromepath = r'C:\Users\Downloads\chromedriver_win32\chromedriver.exe'
Examining the Search Pages
List of Web Addresses of All the Restaurants in Jakarta
Selenium is an extremely handy library for finding HTML elements with different means like class name, id, tag name, XPath, link text, or a CSS Selector. In this blog, we will cover different problems, which we found while we scraped the Zomato Page and then clarify how to deal with them.
Let’s go through a search page given below. In case, we pin a location to Jakarta, presently there are around 1002 pages, having about 15 restaurants on every page. It means Jakarta is having about 15,000 restaurants! Hurrah, that’s unbelievable!
Nowadays, what we need to extract is only a web address on every search page, therefore, we could open that individually. Think about opening over 1000 pages automatically, certainly it would be very exhausting (and tedious, to be frank). Selenium is there to rescue!
Before we write any Python code, we need to understand the difference between two Selenium’s “Find Elements” Tools:
Get Element: Get a single element positioned by stating any particular HTML elements locator.
Get Elements: Get the listing of elements positioned by indicating a general HTML elements locator amongst the elements, which you need to find.
Here, as we need to extract all the websites’ addresses on every search page, we would use the get elements tool. We need to inspect a page’s HTML then get a general HTML element amongst all the web addresses on every searching page.
After we checked the web addresses on the search page, we have concluded that amongst the general HTML elements of a web address is the class name.
Currently, we can have a listing of Selenium Web Elements through writing codes like this:
url_elt = driver.find_elements_by_class_name("result-title")
Looks easy enough, however our target is a website address. Well, we require to write extra code to scrape a URL page’s attributes from all Web Elements through looping through a list.
Using the given code, we can create the listing of web addresses. So, let’s combine that with the codes for looping through the search pages of Zomato with all the 1002 pages.
# Set Webdriver driver = webdriver.Chrome(chromepath) out_lst = [] # Loop Through Search Pages that we wanted for i in range(1, 1003): driver.get('https://www.zomato.com/jakarta/restoran?page={}'.format(i)) url_elt = driver.find_elements_by_class_name("result-title") # Loop Through Lists of Web Elements for j in url_elt: url = j.get_attribute("href") out_lst.append(url) driver.close()
We’ve given the fundamental code for scraping the web addresses, however, we can further improve that by writing a code to display the development of data scraping like that.
# Set Webdriver driver = webdriver.Chrome(chromepath) out_lst = [] # Loop Through Search Pages that we wanted for i in range(1, 1003): print('Opening Search Pages ' + str(i)) driver.get('https://www.zomato.com/jakarta/restoran?page={}'.format(i)) print('Accessing Webpage OK \n') url_elt = driver.find_elements_by_class_name("result-title") # Loop Through Lists of Web Elements for j in url_elt: url = j.get_attribute("href") out_lst.append(url) driver.close()
With the addition of two easy print codes, we could get useful notifications every time we effectively loop through the search pages.
You should not forget to change a listing to Panda’s DataFrame so that we can arrange the data neatly.
out_df = pd.DataFrame(out_lst, columns=['Website'])
Result:
Listing of Web Address about All the Restaurants, Which Serve Deliveries in Jakarta
After that, we will do exactly the similar things to get a restaurants’ list, which serve delivery in Jakarta. The main difference is a search page’s URL as well as total search pages, which we need to go through.
# Set Webdriver driver = webdriver.Chrome(chromepath) out_lst_dlv = [] # Loop Through - Search Pages that we wanted for i in range(1, 224): print('Opening Search Pages ' + str(i)) driver.get('https://www.zomato.com/jakarta/delivery?page={}'.format(i)) print('Accessing Webpage OK \n') url_elt_dlv = driver.find_elements_by_class_name("result-title") # Loop Through Lists of Web Elements for j in url_elt_dlv: url = j.get_attribute("href") out_lst_dlv.append(url) driver.close() # Convert List to DataFrame out_dlv_df = pd.DataFrame(out_lst_dlv, columns=['Website'])
Result:
Removing Duplicate Web Addresses
Now, as there are 14886 Restaurants in Jakarta as well as 3306 of those provide delivery services. Before going deeper in the web scraping, we need to ensure that we don’t get duplicate entries as a search page results sometimes has repeating entries.
We could easily find that with an easy pandas method, repeated.
# Observe whether we have duplicate websites or not out_df[out_df.duplicated(['Website'], keep='first')]
With the above given code, we would see a listing of web addresses, which are duplicated within a DataFrame, apart from for the initial entry of duplicated ones.
We need to create a new DataFrame with no duplicated values for DataFrame for different restaurants as well as DataFrame for giving-delivery restaurants.
# Make A New DataFrame - without duplicated values out_df_nd = out_df[~out_df.duplicated(['Website'], keep='first')] outdlv_df_nd = outdlv_df[~outdlv_df.duplicated(['Website'], keep='first')]
Review Individual Restaurant Pages
We have distinctive web addresses about ~14500 restaurants in Jakarta. Through using that, we can get through every web address as well as extract required information. Let’s write more codes!
Restaurant Name
With this picture, you can see that the Restaurant’s Name is appended with h1 in the HTML code. Formerly, we found elements in the HTML code through class name and now we will find that using a tag name.
name_anchor = driver.find_element_by_tag_name('h1')
Don’t forget that a find element returns the Selenium Web Element, therefore, we need to scrape it more to find the required data. In this case, we could do it through using the given code.
name = name_anchor.text
Let’s go through a complete code of scraping a Restaurant’s Name:
# Initialize Empty Lists that we will use to store the scraping data results rest_name = [] driver = webdriver.Chrome(chromepath) # Scrape the data by looping through entries in DataFrame for url in out_df_nd['Website']: driver.get(url) name_anchor = driver.find_element_by_tag_name('h1') name = name_anchor.text rest_name.append(name) driver.close()
After extracting restaurant pages for some minutes, or hours, possibly your program may stop, showing the error message as a few pages h1 data couldn’t get scraped. In case of Zomato, there are many pages like this:
In case, the browser hit the page, a program would stop as it couldn’t extract the h1. We could easily deal with this through using other functions from a Selenium library: NoSuchElementException. In case, we don’t get the web elements we want, joined with ‘if statement’, we could reroute this program to pass it. Initially, we have to import that function at start of a code.
from selenium.common.exceptions import NoSuchElementException
Then, we would utilize a try-except statement for implementing a logic to this program. In case, we don’t get any h1 elements, then write “404 Error” for Restaurant’s Name, then pass that to next page.
Besides that, what we have done before, let’s write a few print codes for showing the development of data scraping.
# Initialize Empty List that we will use to store the scraping data results rest_name = [] driver = webdriver.Chrome(chromepath) # Scrape the data by looping through entries in DataFrame for url in out_df_nd['Website']: driver.get(url) print('Accessing Webpage OK') try: name_anchor = driver.find_element_by_tag_name('h1') name = name_anchor.text rest_name.append(name) except NoSuchElementException: name = "404 Error" rest_name.append(name) pass print(f'Scraping Restaurant Name - {name} - OK') print('-------------------------------------------------------------------------------------------------------------------------------------------') driver.close()
Result:
Restaurant’s Type
The next step is we wish to extract the kind of all restaurants in Jakarta.
Just like a Web Address, we will find the elements as there are different elements, which we wish to extract. Now the question is, which locator we more use to get this element? In case, we closely look, a tag name isn’t distinctive enough, whereas a class name (sc-jxGEy0) could be different in a few restaurant pages and that’s the reason why we couldn’t get the two locators.
That’s how XPath proves to be very helpful. The XPath means XML Path Language that we can utilize to locate an element that we wish to extract as a structure of the Zomato’s Restaurant Page is mainly the same.
How would this helpful? Just right-click on the HTML code required and click on Copy -> XPath option.
Now as we have for XPath, we need to paste that into the coding program, as well as add the code. Please understand that like rest_name, we need to make the empty list having rest_type on the top.
#Restaurant Type rest_type_list = [] rest_type_eltlist = driver.find_elements_by_xpath("""/html/body/div[1]/div[2]/main/div/section[3]/section/section[1]/section[1]/div/a""") for rest_type_anchor in rest_type_eltlist: rest_type_text = rest_type_anchor.text rest_type_list.append(rest_type_text) rest_type.append(rest_type_list) print(f'Scraping Restaurant Type - {name} - {rest_type_text} - OK')
Restaurant’s Area as well as Address
After that, we require to extract Restaurant Area as well as Address. It is easier as we only require to extract 1 element, as well as before that, we will scrape this through XPath.
#Restaurant Area rest_area_anchor = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[3]/section/section[1]/section[1]/a""") rest_area_text = rest_area_anchor.text rest_area.append(rest_area_text) print(f'Scraping Restaurant Area - {name} - {rest_area_text} - OK') #Restaurant Address rest_address_anchor = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[4]/section/article/section/p""") rest_address_text = rest_address_anchor.text rest_address.append(rest_address_text) print(f'Scraping Restaurant Address - {rest_address_text} - OK')
Restaurant Ratings and Reviews
After completing Name, Address, and Area, let’s go towards slightly trickier data to extract: Reviews and Ratings. As you extract on, you’ll observe that not all the restaurants have that data. Just like a restaurant’s name, we’ll use NoSuchElementException.
#Restaurant Rating try: rest_rating_anchor = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[3]/section/section[2]/section/div[1]/p""") rest_rating_text = rest_rating_anchor.text except NoSuchElementException: rest_rating_text = "Not Rated Yet" pass rest_rating.append(rest_rating_text) print(f'Scraping Restaurant Area - {name} - {rest_rating_text} - OK') #Restaurant Review try: rest_review_anchor = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[3]/section/section[2]/section/div[2]/p""") rest_review_text = rest_review_anchor.text except NoSuchElementException: rest_review_text = "Not Reviewed Yet" pass rest_review.append(rest_review_text) print(f'Scraping Restaurant Review Counts - {name} - {rest_review_text} - OK')
Restaurant’s Average Pricing for 2
Till now, an XPath technique is quite useful to extract the required data. Now, for the data, we need to modify the “find elements by XPath” having some if-else statements. For that particular data, the locations in every restaurant page differs.
In the given picture, an “Average Cost” is situated below the “Popular Dishes” as well as “People Say That Place is Well-Known For.” In the majority of Zomato pages, you won’t get these two information pieces. That’s the reason why location of an Average Cost is different in pages, which are having these two.
We will utilize the string slice utility to check if the data we extracts begins with “Rp” or “No”. In case, it doesn’t begin with two strings, we would extract another XPath.
It’s time to implement this logic in our code.
#Restaurant Price for 2 try: price_for_2_anchor = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[4]/section/section/article[1]/section[2]/p[1]""") price_for_2_text = price_for_2_anchor.text except NoSuchElementException: price_for_2_text = "No Price Data Found" pass if (price_for_2_text[0:2] == 'Rp') or (price_for_2_text[0:2] == 'No'): price_for_2.append(price_for_2_text) else: price_for_2_anchor = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[4]/section/section/article[1]/section[2]/p[2]""") price_for_2_text = price_for_2_anchor.text if (price_for_2_text[0:2] == 'Rp') or (price_for_2_text[0:2] == 'No'): price_for_2.append(price_for_2_text) else: price_for_2_anchor = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[4]/section/section/article[1]/section[2]/p[3]""") price_for_2_text = price_for_2_anchor.text price_for_2.append(price_for_2_text) print(f'Scraping Restaurant Price for Two - {name} - {price_for_2_text} - OK')
Restaurant’s Additional Details
We’ve extracted web elements through different locators, various web elements using similar class name, different web elements in the XPath. Now, we will learn to extract another unique thing!
In this picture, you can see we wish to extract all these text data, which is incorporated in a blue box. In case, we see HTML codes on left side, the data is scattered in various codes.
What we would do to extract capably is: get the elements of a blue box using XPath and then utilizing the results, we get different text elements through tag name p. Let’s see the code of this logic.
#Restaurant Additional Information addt_info_list = [] addt_info_bigelt = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[4]/section/section/article[1]/section[2]/div[3]""") addt_info_eltlist = addt_info_bigelt.find_elements_by_tag_name('p') for addt_info_anchor in addt_info_eltlist: addt_info_text = addt_info_anchor.text addt_info_list.append(addt_info_text) rest_info.append(addt_info_list) print(f'Scraping Restaurant Additional Info - {name} - {addt_info_text} - OK')
Restaurant’s Latitudes & Longitudes
The data scraping is not less complex than ever before. To understand why, we need to look here:
Here, you can see latitudes and longitudes of every restaurant are situated in map’s web addresses. We don’t require to open a map’s web addresses to extract latitudes and longitudes.
We can just scrape web addresses from web element and use the string functions accessible in Python!
#Restaurant Latitude and Longitude map_url = driver.find_element_by_xpath("""/html/body/div[1]/div[2]/main/div/section[4]/section/article/section/div[2]/a""").get_attribute("href") lat = map_url[-28:-15] long = map_url[-14:-1] rest_lat.append(lat) rest_long.append(long) print(f'Scraping Restaurant Latitude-Longitude - {name} - {lat} - {long} - OK')
One More Side Note
Now, as we write different codes for every data, which we wish to extract, we can associate it together for scraping all these data from every page.
However, before doing that, we import one more important library, which we will utilize to pause a program execution.
Relying on the internet connection speed as well as the type or amount, which we wish to extract, we might need to wait for a browser to totally load a page.
import time # To delay the execution of next coding line time.sleep(8)
Scraping All the Restaurants Data
Joining all the given codes, we will get it when we apply the codes. We could clearly see progress of a data scraping procedure with that (as well as you could also observe if there are some mistakes in this scraping procedure!).
Don’t overlook that results of the scraping is a list collection. We require to associate them together for creating a compact and tidy dataset.
rdf = pd.DataFrame({"Restaurant Name" : rest_name[:], "Restaurant Type" : rest_type[:], "Restaurant Area" : rest_area[:], "Restaurant Rating" : rest_rating[:], "Restaurant Review" : rest_review[:], "Price for 2" : price_for_2[:], "Restaurant Address" : rest_address[:], "Additional Info" : rest_info[:], "Latitude" : rest_lat[:], "Longitude" : rest_long[:]})
Result:
You can observe a complete code for scraping separate restaurant pages here.
Conclusion
Well, this spreadsheet is the final result! Now, we have the complete data of Jakarta Food Services, which we require from Zomato. When we come up with Part 2, we will talk about the next steps to complete the data having Reverse Geocoding!
You’ve discovered how to extract Restaurants Data from Zomato. During this procedure, you’ve learned about how to utilize different methods to face various HTML codes.
Furthermore, please remember that Zomato can change its HTML structure anytime, therefore, you might need to adapt more and in case there are some changes. All the websites are different and that’s the reason why it is very important to familiarize the method or code to extract it.
We hope that you will stay healthy in this pandemic, happy scraping!
https://www.foodspark.io/how-to-scrape-data-foods-around-jakarta-using-selenium-zomato.php
Comments
Post a Comment