Part 3: How to Perform an EDA on Yelp Extracted Data?
This is the third in a series of articles that uses BeautifulSoup to scrape Yelp restaurant reviews and then apply Machine Learning to extract insights from the data. In this article, you will use the code to extract all the reviews in a list. The script will be as follows:
import requests from bs4 import BeautifulSoup import time from textblob import TextBlob import pandas as pd#we use these argument to scrape the website rest_dict = [ { "name" : "the-cortez-raleigh", "link" : "https://www.yelp.com/biz/the-cortez-raleigh?osq=Restaurants&start=", "pages" : 3 }, { "name" : "rosewater-kitchen-and-bar-raleigh", "link" : "https://www.yelp.com/biz/rosewater-kitchen-and-bar-raleigh?osq=Restaurants&start=", "pages" : 3 } ]#scraping function def scrape(rest_list): all_comment_list = list() for rest in rest_list: comment_list = list() for pag in range(1, rest['pages']): try: time.sleep(5)#URL = "https://www.yelp.com/biz/the-cortez-raleigh?osq=Restaurants&start="+str(pag*10)+"&sort_by=rating_asc" URL = rest['link']+str(pag*10) print(rest['name'], 'downloading page ', pag*10) page = requests.get(URL)#next step: parsing soup = BeautifulSoup(page.content, 'lxml') soupfor comm in soup.find("yelp-react-root").find_all("p", {"class" : "comment__373c0__Nsutg css-n6i4z7"}): comment_list.append(comm.find("span").decode_contents()) print(comm.find("span").decode_contents()) except: print("could not work properly!") all_comment_list.append([comment_list, rest['name']]) return all_comment_list#store all reviews in a list reviews = scrape(rest_dict)
Here in the script, the output of the function will be saved in a variable known as reviews. While printing the variable, the result will be:
The nested list's structure follows this pattern:
[[[review1, review2], restaurant1], [[review1, review2], restaurant2]]
It will now be converted into DataFrame using Pandas
Converting the Data into a DataFrame
You will need to develop a DataFrame to hold all of the information now that you have established a list using the ratings and their respective restaurants.
df = pd.DataFrame(reviews)
Here, we will try to persuade this hierarchical list into a DataFrame directly, and will end up with a column full of listings and another column with a single restaurant title. To correctly update the data, we will use the explode function, which creates a single row for each element of the list where it's used, in this example, column 0.
df = df.explode(0)
The dataset is now appropriately structured, as you can see in the image. Each review has a restaurant associated with it.
Because the current samples are only numbered with 0 and 1, the only thing left to do is reset the index.
df = df.reset_index(drop=True) df[0:10]
Performing Sentiment Analysis to Classify Reviews
It is complicated to extract restaurant ratings that were previously assigned to each review available on the website. You will need sentiment analysis that will try to discover a solution to the missing information. The NLP model’s interferences regarding values will take the place of each review’s star rating. Obviously, working with the information is an experiment, and sentiment analysis is independent on the model that we employ which is not always precise.
We will use TextBlob, a simple library that already includes a pre-trained algorithm for the task. Because you will have to apply it to every review, we will first develop a function that returns the estimated sentiment of a paragraph in a range of -1 to 1.
def perform_sentiment(x): testimonial = TextBlob(x) #testimonial.sentiment (polarity, subjectvity) testimonial.sentiment.polarity #sentiment_list.append([sentence, testimonial.sentiment.polarity, testimonial.subjectivity]) return testimonial.sentiment.polarity
After developing the function, we will use pandas and apply method to add a new column of our dataset that will hold algorithm analysis results. The sort values method will then be used to sort all of the reviews, starting with the negative ones.
The final dataset will be:
Extracting Word Frequency
To continue with the experiment, we will now extract one of most frequently used words in a dataset division. However, there is a problem. Although certain words have the same root, such as "eating" and "ate," the algorithm will not automatically place them in the same category because they are different when converted to binary. As a solution to this difficulty, we will employ lemmatization, an NLP pre-processing approach.
Lemmatization may isolate the core of any existing word, removing any potential variation and enabling the data to be normalized. Lemmatizers are basic models that must be pre-trained before they can be built. To import a lemmatizer, we will use the spacy library.
!pip install spacy
Spacy is an open-source NLP library that includes a lemmatizer and many pre-trained models. This program will lemmatize all or most of the words in a single message and provide the frequency of each term (the number of times they have been used). We will arrange the results in ascending order to indicate which words have appeared the most frequently in a set of evaluations.
def top_frequent(text, num_words): #frequency of most common words import spacy from collections import Counternlp = spacy.load("en") text = text #lemmatization doc = nlp(text) token_list = list() for token in doc: #print(token, token.lemma_) token_list.append(token.lemma_) token_list lemmatized = '' for _ in token_list: lemmatized = lemmatized + ' ' + _ lemmatized#remove stopwords and punctuations doc = nlp(lemmatized) words = [token.text for token in doc if token.is_stop != True and token.is_punct != True] word_freq = Counter(words) common_words = word_freq.most_common(num_words) return common_words
We will extract the most common words from the worst-rated reviews, rather than the complete list of reviews. The information has already been sorted to place the worst ratings at the front, so all that remains is to build a unique string that contains all of the reviews. To convert the review list into a string, we will use the join function.
text = ' '.join(list(df[0].values[0:20])) texttop_frequent(text, 100)[('great', 22), ('<', 21), ('come', 16), ('order', 16), ('place', 14), ('little', 10), ('try', 10), ('nice', 10), ('food', 10), ('restaurant', 10), ('menu', 10), ('day', 10), ('butter', 9), ('drink', 9), ('dinner', 8), ...
If you are looking to perform an EDA on Yelp data then, you can contact Foodspark today!
Know more : https://www.foodspark.io/part-3-how-to-perform-an-eda-on-yelp-extracted-data.php
Comments
Post a Comment