Web Scraping Books on Goodreads Using Python

5 min readFeb 18, 2025

Books have always been a valuable source of knowledge, entertainment, and culture. With the rise of online bookstores, digital libraries, and review platforms, accessing book-related information has become easier than ever. However, manually gathering details like book titles, authors, prices, ratings, and reviews from multiple websites can be time-consuming.

In this project, we’re diving into Goodreads, one of the largest online book platforms for readers and book recommendations and using web scraping to collect valuable book data. From bestselling titles to user reviews, we’ll extract and analyze a wide range of insights.

Why Goodreads?

Goodreads isn’t just a place for book reviews — it’s is a goldmine for book lovers, offering an abundance of valuable data that goes beyond just book titles. With millions of users contributing reviews, ratings, and recommendations, it’s the perfect platform to tap into real-time reading trends and preferences. It’s packed with rich data about:
📖 Books, Authors, and Summaries
⭐ Ratings & Reviews.
📚 Genres & Recommendations.

Outline of the article

Retrieve the webpage using requests.
Parse the HTML content with BeautifulSoup.
Extract the book title, author, and relevant metadata.
Compile the data and save it into a CSV file.
Analyze and visualize the data collected.

Setting up the Scraping Environment

Tools needed for this project:

Python: Ensure python is installed on your system.
Libraries: Install the required libraries, such as BeautifulSoup and requests.

Requests is a Python library used to send HTTP requests and retrieve webpage content, like HTML or JSON.
BeautifulSoup is a library that parses HTML or XML content, allowing you to navigate and extract specific elements (like titles, links, or images) from the webpage.

3. PowerBI: For Data transformation and visualization.

pip install requests

pip install beautifulsoup4 as bs4

Then import the libraries already installed.

import requests
from bs4 import BeautifulSoup

Retrieve the webpage using requests

The pages used covers top 10 rated books with over 10,000 ratings covering spanning across various genres.

The first step in web scraping is to download the HTML content of the webpage. This is done by sending an HTTP request to the website’s URL. To retrieve the content from the page on Goodreads, we send a get request using the requests library.

page = requests.get(
    "https://www.goodreads.com/list/show/153860.Goodreads_Top_100_Highest_Rated_Books_on_Goodreads_with_at_least_10_000_Ratings").content

Parse the HTML content with `BeautifulSoup`

The HTML source code is parsed using BeautifulSoup. This allows for navigating the page’s structure and extracting the necessary data. The content retrieved in the previous step is used to create a BeautifulSoup object to process the HTML.

soup = BeautifulSoup(page,"html.parser")

Extract the book title, author, and relevant metadata

Before extracting the book title, author, and relevant metadata, it’s essential to inspect the HTML structure of the webpage. This helps identify the correct tags and classes that contain the desired data.

HTML structure from the Goodreads website page

Once the relevant tags are identified, Beautiful Soup’s methods can be used to extract the required information. The <div> tag with the "left" class contains the necessary details, as shown in the image above. To extract specific parts of the HTML, the find() or find_all() methods can be used. Additionally, specifying the class of the tag helps narrow down the results.

This will print the required HTML from the first book in the list i.e. words of radiance.

cards = soup.find("table",class_="tableList js-dataTooltip").find_all("tr")

for card in cards:

    try:
        book_link= card.find ("a",itemprop="url")["href"]
        print(book_link)
        single_book_link= requests.get(f"https://www.goodreads.com{book_link}").content
        soup = BeautifulSoup(single_book_link, "html.parser")
        book_publication = soup.find("p", {"data-testid": "publicationInfo"})
        book_publication = None if book_publication == None else book_publication.text.strip()
        book_genre= soup.find("span", class_="BookPageMetadataSection__genreButton").text.strip()

        book_info= soup.find("script", {"type": "application/ld+json"})
        book_info =  None if book_info == None else book_info.text.strip()
        book_info_json = json.loads(book_info)

        book_title= book_info_json.get("name")
        book_image =book_info_json.get("image")
        book_format= book_info_json.get("bookFormat")
        book_num_of_pages= book_info_json.get("numberOfPages")
        book_in_language= book_info_json.get("inLanguage")
        book_isbn = book_info_json.get("isbn")
        book_awards= book_info_json.get("awards")
        book_author = book_info_json.get("author")
        book_aggregate_rating=book_info_json.get("aggregateRating")
        book_rating_value= book_aggregate_rating.get("ratingValue")
        book_rating_count= book_aggregate_rating.get("ratingCount")
        book_review_count= book_aggregate_rating.get("reviewCount")

        books.append({"book_title":book_title,
                  "book_image":book_image,
                  "book_format":book_format,
                  "book_num_of_pages":book_num_of_pages,
                  "book_in_language":book_in_language,
                  "book_isbn":book_isbn,
                  "book_awards":book_awards,
                  "book_author":book_author,
                  "book_publication": book_publication,
                  "book_genre": book_genre,
                  "book_rating_value": book_rating_value,
                  "book_rating_count":book_rating_count,
                  "book_review_count":book_review_count

                })
        
        print({"book_title":book_title,
                  "book_image":book_image,
                  "book_format":book_format,
                  "book_num_of_pages":book_num_of_pages,
                  "book_in_language":book_in_language,
                  "book_isbn":book_isbn,
                  "book_awards":book_awards,
                  "book_author":book_author,
                  "book_publication": book_publication,
                  "book_genre": book_genre,
                  "book_aggregate_rating":book_aggregate_rating,
                  "book_rating_value": book_rating_value,
                  "book_rating_count":book_rating_count,
                  "book_review_count":book_review_count

                })
    except Exception as e:
        print(e)

Compile the data and save it into a CSV file.

After extracting the data, it is structured and saved into a CSV file using Python’s csv.DictWriter(), ensuring all book details are properly formatted. Additionally, any failed extractions are stored in a JSON file using json.dump(), allowing for further review and debugging.

keys = books[0].keys()
with open("books.csv","w",newline="",encoding="UTF-8") as f:
    writer = csv.DictWriter(f,keys)
    writer.writeheader()
    writer.writerows(books)
    print("Done!")

with open('failed_json.json', 'w') as f:
    # where data is your valid python dictionary
    json.dump(failed_books, f)

Analyze and Visualize the Collected Data

The extracted data is then imported into a visualization tool for analysis. In this project, Power BI was used to create visual representations of the data.

Data cleaning: After importing the data into Power BI, data cleaning and transformation are performed using Power Query to refine and structure the dataset. Once cleaned, the data is used for visualization to uncover meaningful insights.

Data visualization

This was accomplished using Power BI to visualize the information through charts and graphs.

Representation of the analysis using a Dashboard

Notable takeaway:

“Heaven’s Official Blessing” by Mo Xiang Tong, a fantasy genre book, received the most ratings, with a total of 12,047 ratings and 1,599 reviews.
he highest-rated genre was fantasy
English was the most preferred language.
Hardcover was the most popular book format
The number of reviews declined over time, with a peak in 2023 followed by a drop in 2024.

Web Scraping Books on Goodreads Using Python

Why Goodreads?

Outline of the article

Retrieve the webpage using requests

Parse the HTML content with `BeautifulSoup`

Extract the book title, author, and relevant metadata

Compile the data and save it into a CSV file.

Analyze and Visualize the Collected Data

Data visualization

Notable takeaway:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Moses Esther Amarachi

No responses yet

Web Scraping Books on Goodreads Using Python

Why Goodreads?

Outline of the article

Retrieve the webpage using requests

Parse the HTML content with BeautifulSoup

Extract the book title, author, and relevant metadata

Compile the data and save it into a CSV file.

Analyze and Visualize the Collected Data

Data visualization

Notable takeaway:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Moses Esther Amarachi

No responses yet

Parse the HTML content with `BeautifulSoup`