We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

Farin • 3 years ago

Hi again.

I'm getting this error I don't understand:

Traceback (most recent call last):
File "C:\Users\Far\Desktop\Coding\scraping_imdb_episodes.py", line 5, in <module>
community_episodes = pd.DataFrame(community_episodes, columns = ['season', 'episode_number', 'title', 'airdate', 'rating', 'total_votes', 'desc'])
NameError: name 'community_episodes' is not defined

https://uploads.disquscdn.c...

How can i fix it? Thanks.

Farin • 3 years ago

Hi Again. I got it right finally. That was my 1st Python script! The order wasn't right.
Now fixed:
# ADD REQUESTS, BEAUTIFULSOUP AND PANDAS LIBRARIES
import requests
from bs4 import BeautifulSoup
import pandas as pd


# Initializing the series that the loop will populate
community_episodes = []

#GET THE DATA
# For every season in the series-- range depends on the show
for sn in range(1,4):
# Request from the server the content of the web page by using get(), and store the server’s response in the variable response
response = requests.get('https://www.imdb.com/title/tt1439629/episodes?season=' + str(sn))

# Parse the content of the request with BeautifulSoup
page_html = BeautifulSoup(response.text, 'html.parser')

# Select all the episode containers from the season's page
episode_containers = page_html.find_all('div', class_ = 'info')

# For each episode in each season
for episodes in episode_containers:
# Get the info of each episode on the page
season = sn
episode_number = episodes.meta['content']
title = episodes.a['title']
airdate = episodes.find('div', class_='airdate').text.strip()
rating = episodes.find('span', class_='ipl-rating-star__rating').text
total_votes = episodes.find('span', class_='ipl-rating-star__total-votes').text
desc = episodes.find('div', class_='item_description').text.strip()
# Compiling the episode info
episode_data = [season, episode_number, title, airdate, rating, total_votes, desc]

# Append the episode info to the complete dataset
community_episodes.append(episode_data)


#FRAME DATA
# Making the dataframe (+ add "import pandas as pd" 3rd line on top)
community_episodes = pd.DataFrame(community_episodes, columns = ['season', 'episode_number', 'title', 'airdate', 'rating', 'total_votes', 'desc'])

community_episodes.head()

#CLEAN DATA
# Data Cleaning: Converting the total votes count to numeric
# First, we create a function that uses replace() to remove the ‘,’ , ‘(', and ‘)’ strings from total_votes so that we can make it numeric.
def remove_str(votes):
for r in ((',',''), ('(',''),(')','')):
votes = votes.replace(*r)

return votes


#VOTES FORMATTING
#Now we apply the function, taking out the strings, then change the type to int using astype()
community_episodes['total_votes'] = community_episodes.total_votes.apply(remove_str).astype(int)

community_episodes.head()

# RATING AS NUMBER FORMAT
# Making rating numeric instead of a string
community_episodes['rating'] = community_episodes.rating.astype(float)

# AIRDATE AS DATE FORMAT
# Converting the airdate from string to datetime
community_episodes['airdate'] = pd.to_datetime(community_episodes.airdate)

community_episodes.info()

community_episodes.head()

#CREATE A FILE
# Save as CSV File
community_episodes.to_csv('Community_Episodes_IMDb_Ratings.csv',index=False)

https://uploads.disquscdn.c...

Many thanks again!
Be well!

Farin • 3 years ago

This helped also to grasp the flow:
https://www.youtube.com/wat...

Best Fortune!

Renato Lacerda • 4 years ago

Thanks for the tutorial.
I was trying to do the exact same thing a few days ago, but gave up because my python is pretty bad and I've never used bs4.
Your tutorial was simple enough that I could just copy and paste what I needed while still understanding the code.

Muhammad Aminullah • 4 years ago

thanks

Adrian Fletcher • 4 years ago

Edit: In the final code this is fixed, but it says div in the step by step tutorial

Thank you for your tutorial! One note: I found that IMDb rating is found in the span element, not the div element. I double checked on your community example and also saw that the rating was in a span element. It worked when I made that change.