10 Scraping; Parallel Processing

image.png

image.png

Երևան, լուսանկարի հղումը, Հեղինակ՝ Armen Harutunian

TODO

Open In Colab (ToDo)

Song reference - ToDo

📌 Նկարագիր

📚 Ամբողջական նյութը

📺 Տեսանյութեր

🏡 Տնային

📚 Նյութը

🌐 HTML Basics - Understanding Web Structure

Before diving into web scraping, it’s essential to understand the structure of web pages. HTML (HyperText Markup Language) provides the structure of web pages.

What is HTML?

HTML uses tags to define elements. Tags are enclosed in angle brackets < > and usually come in pairs:

<tagname>Content goes here</tagname>

Basic HTML Document Structure:

<!DOCTYPE html>
<html>
<head>
    <title>Page Title</title>
</head>
<body>
    <h1>Main Heading</h1>
    <p>This is a paragraph.</p>
</body>
</html>

Common HTML Tags for Scraping:

Document Structure:

  • <html> - Root element
  • <head> - Contains metadata
  • <title> - Page title
  • <body> - Visible page content

Text Content:

  • <h1>, <h2>, <h3> - Headers (most important to least)
  • <p> - Paragraphs
  • <span> - Inline text container

Containers:

  • <div> - Block-level container (most common)
  • <section> - Semantic section
  • <article> - Independent content

Data Tables:

  • <table>, <tr>, <td>, <th> - Tables, rows, cells, headers

HTML Attributes - The Key to Scraping

Attributes provide additional information about elements and are crucial for web scraping:

<div id="content" class="main-section">
<a href="https://example.com" target="_blank">Link</a>
<img src="image.jpg" alt="Description">
<div data-price="29.99" data-category="electronics">Product</div>

Most Important Attributes for Scraping: - id - Unique identifier (use with # in CSS selectors) - class - CSS class name(s) (use with . in CSS selectors) - href - Link destination - src - Source for images/scripts - data-* - Custom data attributes (very common in modern websites)

Why Attributes Matter: - They help us target specific elements - They often contain valuable data - They make our scrapers more precise

🎯 CSS Selectors - Your Scraping Toolkit

CSS selectors are THE MOST IMPORTANT concept in web scraping. They tell your scraper exactly which elements to extract.

Basic Selectors:

1. Element Selector:

p          /* Selects all <p> elements */
div        /* Selects all <div> elements */
h1         /* Selects all <h1> elements */

2. Class Selector (starts with .):

.classname     /* Selects elements with class="classname" */
.post-title    /* Selects elements with class="post-title" */
.btn-primary   /* Selects elements with class="btn-primary" */

3. ID Selector (starts with #):

#idname        /* Selects element with id="idname" */
#main-content  /* Selects element with id="main-content" */
#header        /* Selects element with id="header" */

4. Attribute Selector:

[href]                    /* Elements with href attribute */
[class="post"]           /* Elements with class="post" */
[data-price="29.99"]     /* Elements with data-price="29.99" */

Advanced CSS Selectors:

Combination Selectors:

div p              /* All <p> inside <div> (descendant) */
div > p            /* Direct <p> children of <div> */
h1 + p             /* First <p> immediately after <h1> */
.post .title       /* Elements with class "title" inside elements with class "post" */

Multiple Classes:

.post.featured     /* Elements with BOTH classes "post" AND "featured" */
.btn.btn-primary   /* Elements with BOTH classes "btn" AND "btn-primary" */

Pseudo-selectors:

p:first-child      /* First <p> element of its parent */
p:last-child       /* Last <p> element of its parent */
p:nth-child(2)     /* Second <p> element of its parent */
a:contains("Next") /* Links containing text "Next" */

Complex Examples:

div.post-content p.highlight    /* <p> with class "highlight" inside <div> with class "post-content" */
#main-content .sidebar a[href]  /* Links inside sidebar inside main content */
table tr:nth-child(odd)         /* Odd rows in a table */

🥄 Beautiful Soup - Your HTML Parsing Companion

Beautiful Soup is perfect for beginners and handles most scraping tasks effectively. It makes parsing HTML as easy as navigating a family tree!

Why Beautiful Soup?

  • Easy to learn: Intuitive syntax
  • Powerful: Handles messy HTML gracefully
  • Flexible: Multiple ways to find elements
  • Robust: Handles encoding issues automatically

Core Concepts:

  1. Parsing: Convert HTML text into a navigable object
  2. Searching: Find specific elements using tags, attributes, or CSS selectors
  3. Extracting: Get text, attributes, or sub-elements
  4. Navigating: Move between parent, children, and sibling elements

Sample HTML

from bs4 import BeautifulSoup

# Sample HTML for practicing CSS selectors
practice_html = """
<html>
<body>
    <div id="header" class="top-section">
        <h1 class="main-title">Welcome to Our Store</h1>
        <nav class="navigation">
            <a href="/home">Home</a>
            <a href="/products">Products</a>
            <a href="/contact">Contact</a>
        </nav>
    </div>
    
    <div id="main-content">
        <div class="product featured" data-price="199.99">
            <h2 class="product-title">iPhone 15</h2>
            <p class="description">Latest smartphone with amazing features</p>
            <span class="price">$199.99</span>
        </div>
        
        <div class="product" data-price="89.99">
            <h2 class="product-title">Headphones</h2>
            <p class="description">High-quality wireless headphones</p>
            <span class="price">$89.99</span>
        </div>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(practice_html, 'html.parser')

print(soup.prettify())
<html>
 <body>
  <div class="top-section" id="header">
   <h1 class="main-title">
    Welcome to Our Store
   </h1>
   <nav class="navigation">
    <a href="/home">
     Home
    </a>
    <a href="/products">
     Products
    </a>
    <a href="/contact">
     Contact
    </a>
   </nav>
  </div>
  <div id="main-content">
   <div class="product featured" data-price="199.99">
    <h2 class="product-title">
     iPhone 15
    </h2>
    <p class="description">
     Latest smartphone with amazing features
    </p>
    <span class="price">
     $199.99
    </span>
   </div>
   <div class="product" data-price="89.99">
    <h2 class="product-title">
     Headphones
    </h2>
    <p class="description">
     High-quality wireless headphones
    </p>
    <span class="price">
     $89.99
    </span>
   </div>
  </div>
 </body>
</html>
soup.select('h2')

soup.h2
<h2 class="product-title">iPhone 15</h2>
soup.select('.product') # . - class
[<div class="product featured" data-price="199.99">
 <h2 class="product-title">iPhone 15</h2>
 <p class="description">Latest smartphone with amazing features</p>
 <span class="price">$199.99</span>
 </div>,
 <div class="product" data-price="89.99">
 <h2 class="product-title">Headphones</h2>
 <p class="description">High-quality wireless headphones</p>
 <span class="price">$89.99</span>
 </div>]
soup.select('#main-content') # # - id
[<div id="main-content">
 <div class="product featured" data-price="199.99">
 <h2 class="product-title">iPhone 15</h2>
 <p class="description">Latest smartphone with amazing features</p>
 <span class="price">$199.99</span>
 </div>
 <div class="product" data-price="89.99">
 <h2 class="product-title">Headphones</h2>
 <p class="description">High-quality wireless headphones</p>
 <span class="price">$89.99</span>
 </div>
 </div>]

# 1. Basic selectors
print("\n1️⃣ Basic Selectors:")
print(f"All h2 elements: {len(soup.select('h2'))} found")
print(f"Elements with class 'product': {len(soup.select('.product'))} found")
print(f"Element with id 'header': {len(soup.select('#header'))} found")

1️⃣ Basic Selectors:
All h2 elements: 2 found
Elements with class 'product': 2 found
Element with id 'header': 1 found
# 2. Find specific content
print("2️⃣ Finding Specific Content:")
main_title = soup.select_one('h1.main-title')


print(main_title.text)
2️⃣ Finding Specific Content:
Welcome to Our Store
navigation_links = soup.select('nav.navigation a')
print(navigation_links)

for nav in navigation_links:
    print(nav.text)
    print(nav["href"]) # hyper href attribute
[<a href="/home">Home</a>, <a href="/products">Products</a>, <a href="/contact">Contact</a>]
Home
/home
Products
/products
Contact
/contact
# 3. Product information
print("\n3️⃣ Extract Product Information:")
products = soup.select('div.product')

print(products)
products[0]

3️⃣ Extract Product Information:
[<div class="product featured" data-price="199.99">
<h2 class="product-title">iPhone 15</h2>
<p class="description">Latest smartphone with amazing features</p>
<span class="price">$199.99</span>
</div>, <div class="product" data-price="89.99">
<h2 class="product-title">Headphones</h2>
<p class="description">High-quality wireless headphones</p>
<span class="price">$89.99</span>
</div>]
<div class="product featured" data-price="199.99">
<h2 class="product-title">iPhone 15</h2>
<p class="description">Latest smartphone with amazing features</p>
<span class="price">$199.99</span>
</div>

for product in products:
    title = product.select_one('.product-title').text
    price = product.select_one('.price').text
    is_featured = 'featured' in product.get('class', [])
    
    print("title")
    print(f"  Price: {price}")
    print(f"  Featured: {'✅' if is_featured else '❌'}")
title
  Price: $199.99
  Featured: ✅
title
  Price: $89.99
  Featured: ❌
# 4. Advanced selectors
print("\n4️⃣ Advanced Selectors:")
featured_product = soup.select_one('.product.featured .product-title')
if featured_product:
    print(f"Featured product: {featured_product.text}")

expensive_products = soup.select('[data-price]')
print(f"Products with price data: {len(expensive_products)}")

4️⃣ Advanced Selectors:
Featured product: iPhone 15
Products with price data: 2
# Beautiful Soup Complete Example - Part 2: Finding Elements
# Using the same soup object from Part 1

print("🥄 Beautiful Soup Complete Example - Part 2")
print("=" * 50)

print("\n🔍 Different Ways to Find Elements:")

# Method 1: By tag name
print("\n1️⃣ Find by tag name:")
titles = soup.find_all('h2')
print(f"   Found {len(titles)} h2 elements:")
for i, title in enumerate(titles, 1):
    print(f"     {i}. {title.text}")

# Method 2: By class
print("\n2️⃣ Find by class:")
posts = soup.find_all('span', class_='price')
print(f"   Found {len(posts)} articles with class 'post'")

# Method 3: By ID
print("\n3️⃣ Find by ID:")
header = soup.find('header', id='main-header')
if header:
    print(f"   Header found: {header.h1.text}")

# Method 4: By multiple attributes
print("\n4️⃣ Find by multiple attributes:")
featured_post = soup.find('article', {'class': 'post featured'})
if featured_post:
    print(f"   Featured post: {featured_post.find('h2').text}")

# Method 5: By custom attributes
print("\n5️⃣ Find by custom attributes:")
post_123 = soup.find('article', {'data-id': '123'})
if post_123:
    author = post_123.find('span', class_='author').text
    print(f"   Post 123 author: {author}")

print("\n💡 Key Takeaway:")
print("   find() → First match or None")
print("   find_all() → List of all matches")

🌐 Real Website Scraping Example

# Step 1: Import required libraries for web scraping
import requests
from bs4 import BeautifulSoup

import time
import json

print("✅ Libraries imported successfully!")
url = "http://quotes.toscrape.com/"

response = requests.get(url)
response.raise_for_status()  # Raise exception for bad status codes

print(response.status_code)

soup = BeautifulSoup(response.text, 'html.parser')

print(soup.prettify())
200
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert-Einstein">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
       <a class="tag" href="/tag/change/page/1/">
        change
       </a>
       <a class="tag" href="/tag/deep-thoughts/page/1/">
        deep-thoughts
       </a>
       <a class="tag" href="/tag/thinking/page/1/">
        thinking
       </a>
       <a class="tag" href="/tag/world/page/1/">
        world
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “It is our choices, Harry, that show what we truly are, far more than our abilities.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        J.K. Rowling
       </small>
       <a href="/author/J-K-Rowling">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
       <a class="tag" href="/tag/abilities/page/1/">
        abilities
       </a>
       <a class="tag" href="/tag/choices/page/1/">
        choices
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert-Einstein">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
       <a class="tag" href="/tag/inspirational/page/1/">
        inspirational
       </a>
       <a class="tag" href="/tag/life/page/1/">
        life
       </a>
       <a class="tag" href="/tag/live/page/1/">
        live
       </a>
       <a class="tag" href="/tag/miracle/page/1/">
        miracle
       </a>
       <a class="tag" href="/tag/miracles/page/1/">
        miracles
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Jane Austen
       </small>
       <a href="/author/Jane-Austen">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
       <a class="tag" href="/tag/aliteracy/page/1/">
        aliteracy
       </a>
       <a class="tag" href="/tag/books/page/1/">
        books
       </a>
       <a class="tag" href="/tag/classic/page/1/">
        classic
       </a>
       <a class="tag" href="/tag/humor/page/1/">
        humor
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Marilyn Monroe
       </small>
       <a href="/author/Marilyn-Monroe">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
       <a class="tag" href="/tag/be-yourself/page/1/">
        be-yourself
       </a>
       <a class="tag" href="/tag/inspirational/page/1/">
        inspirational
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “Try not to become a man of success. Rather become a man of value.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert-Einstein">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
       <a class="tag" href="/tag/adulthood/page/1/">
        adulthood
       </a>
       <a class="tag" href="/tag/success/page/1/">
        success
       </a>
       <a class="tag" href="/tag/value/page/1/">
        value
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “It is better to be hated for what you are than to be loved for what you are not.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        André Gide
       </small>
       <a href="/author/Andre-Gide">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="life,love" itemprop="keywords"/>
       <a class="tag" href="/tag/life/page/1/">
        life
       </a>
       <a class="tag" href="/tag/love/page/1/">
        love
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “I have not failed. I've just found 10,000 ways that won't work.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Thomas A. Edison
       </small>
       <a href="/author/Thomas-A-Edison">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
       <a class="tag" href="/tag/edison/page/1/">
        edison
       </a>
       <a class="tag" href="/tag/failure/page/1/">
        failure
       </a>
       <a class="tag" href="/tag/inspirational/page/1/">
        inspirational
       </a>
       <a class="tag" href="/tag/paraphrased/page/1/">
        paraphrased
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Eleanor Roosevelt
       </small>
       <a href="/author/Eleanor-Roosevelt">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
       <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">
        misattributed-eleanor-roosevelt
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “A day without sunshine is like, you know, night.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Steve Martin
       </small>
       <a href="/author/Steve-Martin">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
       <a class="tag" href="/tag/humor/page/1/">
        humor
       </a>
       <a class="tag" href="/tag/obvious/page/1/">
        obvious
       </a>
       <a class="tag" href="/tag/simile/page/1/">
        simile
       </a>
      </div>
     </div>
     <nav>
      <ul class="pager">
       <li class="next">
        <a href="/page/2/">
         Next
         <span aria-hidden="true">
          →
         </span>
        </a>
       </li>
      </ul>
     </nav>
    </div>
    <div class="col-md-4 tags-box">
     <h2>
      Top Ten tags
     </h2>
     <span class="tag-item">
      <a class="tag" href="/tag/love/" style="font-size: 28px">
       love
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/inspirational/" style="font-size: 26px">
       inspirational
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/life/" style="font-size: 26px">
       life
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/humor/" style="font-size: 24px">
       humor
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/books/" style="font-size: 22px">
       books
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/reading/" style="font-size: 14px">
       reading
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/friendship/" style="font-size: 10px">
       friendship
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/friends/" style="font-size: 8px">
       friends
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/truth/" style="font-size: 8px">
       truth
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/simile/" style="font-size: 6px">
       simile
      </a>
     </span>
    </div>
   </div>
  </div>
  <footer class="footer">
   <div class="container">
    <p class="text-muted">
     Quotes by:
     <a href="https://www.goodreads.com/quotes">
      GoodReads.com
     </a>
    </p>
    <p class="copyright">
     Made with
     <span class="zyte">
      ❤
     </span>
     by
     <a class="zyte" href="https://www.zyte.com">
      Zyte
     </a>
    </p>
   </div>
  </footer>
 </body>
</html>
soup.select("h1 a")[0].text
[<h1>
 <a href="/" style="text-decoration: none">Quotes to Scrape</a>
 </h1>]
quotes = soup.find_all("div", class_="quote")
soup.select("div.quote")
[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>
 <a class="tag" href="/tag/choices/page/1/">choices</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 <a class="tag" href="/tag/life/page/1/">life</a>
 <a class="tag" href="/tag/live/page/1/">live</a>
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
 <span>by <small class="author" itemprop="author">Jane Austen</small>
 <a href="/author/Jane-Austen">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
 <a class="tag" href="/tag/books/page/1/">books</a>
 <a class="tag" href="/tag/classic/page/1/">classic</a>
 <a class="tag" href="/tag/humor/page/1/">humor</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
 <span>by <small class="author" itemprop="author">Marilyn Monroe</small>
 <a href="/author/Marilyn-Monroe">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
 <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
 <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
 <a class="tag" href="/tag/success/page/1/">success</a>
 <a class="tag" href="/tag/value/page/1/">value</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
 <span>by <small class="author" itemprop="author">André Gide</small>
 <a href="/author/Andre-Gide">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="life,love" itemprop="keywords"/>
 <a class="tag" href="/tag/life/page/1/">life</a>
 <a class="tag" href="/tag/love/page/1/">love</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
 <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
 <a href="/author/Thomas-A-Edison">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
 <a class="tag" href="/tag/edison/page/1/">edison</a>
 <a class="tag" href="/tag/failure/page/1/">failure</a>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
 <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
 <a href="/author/Eleanor-Roosevelt">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
 <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
 <span>by <small class="author" itemprop="author">Steve Martin</small>
 <a href="/author/Steve-Martin">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
 <a class="tag" href="/tag/humor/page/1/">humor</a>
 <a class="tag" href="/tag/obvious/page/1/">obvious</a>
 <a class="tag" href="/tag/simile/page/1/">simile</a>
 </div>
 </div>]
len(quotes)
10
for quote in quotes:
    text = quote.select_one(".text").text
    author = quote.select_one(".author").text
    tags = [tag.text for tag in quote.select(".tag")]
    


quotes[0].select(".tag")[1].text
'deep-thoughts'
for i in range(10):
    print(f"Page {i + 1}")
    url = f"https://quotes.toscrape.com/page/{i}/"
    print(url)
    response = requests.get(url)
    response.raise_for_status()  # Raise exception for bad status codes

# or find next
def process_page(url):
    print("processing", url)
    response = requests.get(url)
    try:
        response.raise_for_status()  # Raise exception for bad status codes
    except requests.exceptions.HTTPError as e:
        print(f"Error fetching {url}: {e}")
        return
    soup = BeautifulSoup(response.text, 'html.parser')
    
    quotes = soup.select("div.quote")
    for quote in quotes:
        text = quote.select_one(".text").text
        author = quote.select_one(".author").text
        tags = [tag.text for tag in quote.select(".tag")]
        # print(f"Text: {text}, Author: {author}, Tags: {tags}")
        break
    
    next_page = soup.select_one(".next a")["href"]
    next_url = f"https://quotes.toscrape.com{next_page}"
    
    process_page(next_url)

process_page("https://quotes.toscrape.com/")
!pip install beautifulsoup4 requests 
ToDo  

1. Advanced Parsing
2. pip install 
3. robots.txt # file:///C:/Users/hayk_/Downloads/sitemap-1.xml/sitemap-1.xml 
4. waiting
5. joblib

🎯 Advanced Beautiful Soup Techniques

1. Different Parsing Methods:

# Advanced Beautiful Soup techniques
from bs4 import BeautifulSoup
import re

sample_html = """
<div class="container">
    <div class="product" data-price="29.99" data-category="electronics">
        <h3>Smartphone</h3>
        <p class="description">Latest smartphone with amazing features</p>
        <span class="price">$29.99</span>
        <div class="reviews">
            <span class="rating">4.5</span>
            <span class="review-count">(150 reviews)</span>
        </div>
    </div>
    
    <div class="product" data-price="599.99" data-category="electronics">
        <h3>Laptop</h3>
        <p class="description">High-performance laptop for professionals</p>
        <span class="price">$599.99</span>
        <div class="reviews">
            <span class="rating">4.8</span>
            <span class="review-count">(89 reviews)</span>
        </div>
    </div>
    
    <article class="blog-post">
        <h2>Tech News</h2>
        <p>Latest technology trends and updates...</p>
        <time datetime="2025-01-15">January 15, 2025</time>
    </article>
</div>
"""

soup = BeautifulSoup(sample_html, 'html.parser')

print("🔧 Advanced Beautiful Soup Techniques:")
print("=" * 50)

# 1. Find with attributes
print("\n1️⃣ Finding by attributes:")
expensive_products = soup.find_all('div', {'data-price': lambda x: x and float(x) > 100})

for product in expensive_products:
    name = product.h3.text
    price = product.get('data-price')
    print(f"   {name}: ${price}")
🔧 Advanced Beautiful Soup Techniques:
==================================================

1️⃣ Finding by attributes:
   Laptop: $599.99

# 2. Using regular expressions
print("\n2️⃣ Using regex patterns:")
price_spans = soup.find_all('span', string=re.compile(r'\$\d+\.\d+'))
# r'\$\d+\.\d+' - this regex matches dollar amounts like $29.99
for span in price_spans:
    print(f"   Found price: {span.text}")

2️⃣ Using regex patterns:
   Found price: $29.99
   Found price: $599.99
# 3. CSS selectors advanced
print("\n3️⃣ Advanced CSS selectors:")
# Products with rating above 4.6
high_rated = soup.select('div.product:has(.rating)')
for product in high_rated:
    name = product.h3.text
    rating = product.select_one('.rating').text
    if float(rating) > 4.6:
        print(f"   High-rated: {name} ({rating}⭐)")

3️⃣ Advanced CSS selectors:
   High-rated: Laptop (4.8⭐)
# 4. Parent and sibling navigation
print("\n4️⃣ Navigation between elements:")
rating_element = soup.find('span', class_='rating')
if rating_element:
    # Get parent
    reviews_div = rating_element.parent
    print(f"   Parent element: {reviews_div.name} with class '{reviews_div.get('class', [])}'")
    
    # Get sibling
    review_count = rating_element.find_next_sibling('span')
    print(f"   Review count: {review_count.text}")
    
    print(review_count.find_next_sibling('span'))
    

4️⃣ Navigation between elements:
   Parent element: div with class '['reviews']'
   Review count: (150 reviews)
None

🚗 Selenium - For Dynamic and JavaScript-Heavy Websites

🤔 When Do You Need Selenium?

Beautiful Soup + Requests works great for static HTML, but many modern websites use JavaScript to load content dynamically. This is where Selenium comes in.

Signs You Need Selenium:

  • Content loads after the page loads (AJAX)
  • You need to click buttons or fill forms
  • The data you want appears only after user interaction
  • The website is a Single Page Application (SPA)
  • You see “Loading…” messages or spinners

What Selenium Does:

  • Controls a real browser (Chrome, Firefox, Safari)
  • Executes JavaScript like a human user
  • Waits for content to load dynamically
  • Simulates user actions (clicks, typing, scrolling)

⚡ Selenium vs Beautiful Soup Comparison:

Feature Beautiful Soup Selenium
Speed ⚡ Very fast 🐌 Slower (launches browser)
JavaScript ❌ No support ✅ Full support
User Interaction ❌ Cannot click/type ✅ Can simulate user actions
Memory Usage 💚 Low 🔴 High (browser overhead)
Complexity 💚 Simple 🟡 More complex setup
Best For Static websites Dynamic/interactive websites

🛠️ Selenium Installation & Setup

Step 1: Install Selenium

pip install selenium webdriver-manager

Step 2: Understanding WebDrivers

Selenium needs a WebDriver to control browsers: - ChromeDriver - For Google Chrome - GeckoDriver - For Firefox
- EdgeDriver - For Microsoft Edge

Good News: webdriver-manager automatically downloads the correct driver!

Step 3: Basic Setup Options

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Option 1: Visible browser (for development/debugging)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Option 2: Headless browser (for production)
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

Step 4: Common Chrome Options

options = webdriver.ChromeOptions()
options.add_argument("--headless")          # Run without GUI
options.add_argument("--no-sandbox")        # Required for some environments
options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource problems
options.add_argument("--window-size=1920,1080")  # Set window size
options.add_argument("--user-agent=Custom User Agent")  # Custom user agent
# Install Selenium and WebDriver
!pip install selenium webdriver-manager
# Selenium Basic Example - Part 1: Setup and Navigation
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

def selenium_basic_demo():
    """Demonstrate Selenium basic usage"""
    
    print("🚗 Selenium Basic Demo - Part 1: Setup")
    print("=" * 45)
    
    # Setup Chrome options for demo
    options = Options()
    options.add_argument("--headless")  # Run without GUI for demo
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    
    try:
        print("\n📋 Step 1: Initialize WebDriver")
        # This automatically downloads ChromeDriver if needed
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=options)
        print("   ✅ Chrome WebDriver initialized successfully")
        
        print("\n🌐 Step 2: Navigate to website")
        url = "https://quotes.toscrape.com/js/"  # JavaScript version
        driver.get(url)
        print(f"   📍 Navigated to: {url}")
        print(f"   📄 Page title: {driver.title}")
        
        print("\n⏳ Step 3: Wait for content to load")
        # Wait up to 10 seconds for quotes to appear
        wait = WebDriverWait(driver, 10)
        quotes = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "quote")))
        print(f"   ✅ Found {len(quotes)} quotes after waiting for JavaScript")
        
        print(f"\n📊 Page Information:")
        print(f"   Current URL: {driver.current_url}")
        print(f"   Page source length: {len(driver.page_source):,} characters")
        
        return driver, quotes
        
    except Exception as e:
        print(f"❌ Error in Selenium demo: {e}")
        return None, []

# Note: This demo shows setup - actual scraping in next cell
print("� Note: This example shows Selenium setup and navigation.")
print("� For full functionality, Chrome browser and ChromeDriver are required.")
print("� In Colab/Jupyter environments, additional setup might be needed.")

# Uncomment the line below to run the demo (if Chrome is available)
# driver, quotes = selenium_basic_demo()
# Selenium Basic Example - Part 2: Data Extraction and Interaction
# This continues from Part 1

def selenium_scraping_demo():
    """Demonstrate Selenium data extraction and interaction"""
    
    print("🚗 Selenium Basic Demo - Part 2: Data Extraction")
    print("=" * 50)
    
    print("\n📝 Common Selenium Element Location Methods:")
    print("   By.CLASS_NAME    → find_element(By.CLASS_NAME, 'quote')")
    print("   By.ID            → find_element(By.ID, 'main-content')")
    print("   By.TAG_NAME      → find_element(By.TAG_NAME, 'h1')")
    print("   By.CSS_SELECTOR  → find_element(By.CSS_SELECTOR, '.quote .text')")
    print("   By.XPATH         → find_element(By.XPATH, '//div[@class=\"quote\"]')")
    
    # Simulated data extraction (would work with real driver)
    simulated_quotes = [
        {
            'text': '"The world as we have created it is a process of our thinking."',
            'author': 'Albert Einstein',
            'tags': ['change', 'deep-thoughts', 'thinking', 'world']
        },
        {
            'text': '"It is our choices, Harry, that show what we truly are."',
            'author': 'J.K. Rowling',
            'tags': ['abilities', 'choices']
        },
        {
            'text': '"There are only two ways to live your life."',
            'author': 'Albert Einstein',
            'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']
        }
    ]
    
    print(f"\n🔍 Extracting Data with Selenium:")
    print("   (Simulated - shows the process)")
    
    for i, quote_data in enumerate(simulated_quotes, 1):
        print(f"\n💬 Quote {i}:")
        print(f"   Text: {quote_data['text']}")
        print(f"   Author: {quote_data['author']}")
        print(f"   Tags: {', '.join(quote_data['tags'])}")
    
    print(f"\n🎯 Real Selenium Code Pattern:")
    selenium_code = '''
# Real Selenium extraction code:
quotes = driver.find_elements(By.CLASS_NAME, "quote")

for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, "text").text
    author = quote.find_element(By.CLASS_NAME, "author").text
    tags = [tag.text for tag in quote.find_elements(By.CLASS_NAME, "tag")]
    
    quote_data = {
        'text': text,
        'author': author,
        'tags': tags
    }
'''
    
    print(selenium_code)
    
    print(f"\n🖱️ Selenium Interaction Examples:")
    interaction_code = '''
# Click elements
button = driver.find_element(By.ID, "load-more-btn")
button.click()

# Fill forms
search_box = driver.find_element(By.NAME, "search")
search_box.send_keys("python")
search_box.submit()

# Scroll page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for specific conditions
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, "submit-btn")))
'''
    
    print(interaction_code)
    
    print(f"\n⚠️ Important Selenium Concepts:")
    print("   🕐 Explicit Waits: Wait for specific conditions")
    print("   🕑 Implicit Waits: Global wait time for all elements")
    print("   🎭 Headless Mode: Run without visible browser")
    print("   🔒 Always Close: driver.quit() to free resources")

# Run the demo
selenium_scraping_demo()

🚀 Parallel Web Scraping & Multiprocessing

When scraping large amounts of data, performance becomes crucial. Python’s multiprocessing and libraries like joblib allow us to speed up scraping by processing multiple URLs simultaneously.

🧠 Why Use Parallel Processing?

Sequential Processing: - Scrapes one URL at a time - Total time = (number of URLs) × (average time per URL) - CPU cores remain underutilized

Parallel Processing: - Scrapes multiple URLs simultaneously - Total time ≈ (number of URLs) ÷ (number of workers) × (average time per URL) - Better resource utilization

⚠️ Important: Always respect websites’ rate limits and robots.txt when using parallel processing!

🔧 Basic Multiprocessing Concepts

Before applying multiprocessing to web scraping, let’s understand the basics with simple examples.

📦 Introduction to Joblib

joblib is a powerful library that makes parallel computing easy and efficient. It’s particularly great for: - CPU-bound tasks - Machine learning workloads - Data processing pipelines

Key advantages: - Simple API: Parallel(n_jobs=-1)(delayed(function)(args) for args in data) - Automatic memory optimization - Built-in progress tracking - Works well with NumPy arrays

!pip install joblib
# Install joblib if not already installed

from joblib import Parallel, delayed
import numpy as np

def process_data(x):
    """Simulate data processing"""
    time.sleep(1)
    return x ** 3 + 2 * x ** 2 + x + 1
data = list(range(1, 51)) 

print("\n🐌 Sequential Processing:")
start_time = time.time()
sequential_results = [process_data(x) for x in data]
sequential_time = time.time() - start_time

🐌 Sequential Processing:
print(f"   Time taken: {sequential_time:.2f} seconds")
   Time taken: 3.12 seconds
# Parallel(n_jobs=-1)(delayed(process_data)(x) for x in data)
print("\n⚡ Joblib Parallel (all cores):") # -1 -> All
start_time = time.time()
parallel_results = Parallel(n_jobs=-1, verbose=1)(delayed(process_data)(x) for x in data)
parallel_time = time.time() - start_time


Parallel(n_jobs=-1, verbose=1)(delayed(process_data)(x) for x in data)

⚡ Joblib Parallel (all cores):
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    7.0s finished
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    7.0s finished
# Parallel(n_jobs=-1)(delayed(process_data)(x) for x in data)
print("\n⚡ Joblib Parallel (all cores):") # -1 -> All
start_time = time.time()
parallel_results = Parallel(n_jobs=3, verbose=-1)(delayed(process_data)(x) for x in data)
parallel_time = time.time() - start_time


Parallel(n_jobs=-1, verbose=1)(delayed(process_data)(x) for x in data)

⚡ Joblib Parallel (all cores):
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  50 out of  50 | elapsed:   18.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=3)]: Done  50 out of  50 | elapsed:   18.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    8.5s finished
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    8.5s finished
[5,
 19,
 49,
 101,
 181,
 295,
 449,
 649,
 901,
 1211,
 1585,
 2029,
 2549,
 3151,
 3841,
 4625,
 5509,
 6499,
 7601,
 8821,
 10165,
 11639,
 13249,
 15001,
 16901,
 18955,
 21169,
 23549,
 26101,
 28831,
 31745,
 34849,
 38149,
 41651,
 45361,
 49285,
 53429,
 57799,
 62401,
 67241,
 72325,
 77659,
 83249,
 89101,
 95221,
 101615,
 108289,
 115249,
 122501,
 130051]
parallel_results
print(f"   Time taken: {parallel_time:.2f} seconds")
print(f"   Speedup: {sequential_time/parallel_time:.2f}x faster")
   Time taken: 0.50 seconds
   Speedup: 6.21x faster
print("\n📊 Joblib with Progress Tracking:")
start_time = time.time()
parallel_results_verbose = Parallel(n_jobs=4, verbose=1)(
    delayed(process_data)(x) for x in data
)
verbose_time = time.time() - start_time
print(f"   Time taken: {verbose_time:.2f} seconds")

📊 Joblib with Progress Tracking:
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
   Time taken: 1.48 seconds
[Parallel(n_jobs=4)]: Done  50 out of  50 | elapsed:    1.4s finished

🌐 Parallel Web Scraping Examples

Now let’s apply these concepts to web scraping. We’ll compare sequential vs parallel approaches for scraping multiple URLs.

🛡️ Advanced Parallel Scraping with Rate Limiting

When scraping real websites, we need to be more careful about rate limiting, error handling, and respecting server resources.

https://www.ysu.am/robots.txt

⚡ Performance Optimization & Best Practices

🎯 Choosing the Right Approach

Method Best For Pros Cons
Sequential Small datasets, strict rate limits Simple, predictable Slow for large datasets
Threading I/O-bound tasks, many small requests Good for network-bound tasks GIL limitations in Python
Multiprocessing CPU-intensive parsing True parallelism Higher memory usage
Joblib Balanced approach, data science tasks Easy to use, optimized Extra dependency

🛡️ Rate Limiting Strategies

# 1. Fixed delay between requests
time.sleep(1)

# 2. Random delay (more human-like)
time.sleep(random.uniform(0.5, 2.0))

# 3. Exponential backoff on errors
wait_time = (2 ** attempt) + random.uniform(0, 1)

# 4. Domain-specific rate limiting
# Different limits for different websites

🕸️ Scrapy Framework - Industrial-Strength Web Scraping

Scrapy is not just a library - it’s a complete framework for building web scrapers. Think of it as the difference between a hammer (Beautiful Soup) and a complete construction toolkit (Scrapy).

🤔 When to Choose Scrapy vs Beautiful Soup?

Use Case Beautiful Soup Scrapy
Simple, one-time scraping ✅ Perfect ❌ Overkill
Large-scale projects ❌ Limited ✅ Excellent
Multiple websites ❌ Manual work ✅ Built-in support
Following links automatically ❌ Manual coding ✅ Built-in
Data export (CSV, JSON) ❌ Manual coding ✅ Built-in
Handling cookies/sessions ❌ Manual coding ✅ Automatic
Concurrent requests ❌ Manual threading ✅ Built-in
Respecting robots.txt ❌ Manual checking ✅ Automatic

🏗️ Scrapy Architecture

Scrapy follows a component-based architecture:

  1. Engine - Controls data flow between components
  2. Scheduler - Manages which URLs to scrape next
  3. Downloader - Fetches web pages
  4. Spiders - Your custom logic for extracting data
  5. Item Pipeline - Processes extracted data
  6. Middlewares - Hooks for customizing requests/responses

🚀 Getting Started with Scrapy

Installation:

pip install scrapy

Creating a Scrapy Project:

# Create new project
scrapy startproject myproject

# Project structure created:
myproject/
    scrapy.cfg            # deploy configuration file
    myproject/            # project's Python module
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # directory for spiders
            __init__.py

Key Files Explained:

  • spiders/ - Where you write your scraping logic
  • items.py - Define what data you want to extract
  • pipelines.py - Process the extracted data
  • settings.py - Configure how Scrapy behaves

🕷️ Understanding Scrapy Spiders

A Spider is a class that defines how to scrape a website. Every spider must: 1. Have a unique name 2. Define starting URLs 3. Implement a parse method

Basic Spider Structure:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'                    # Unique identifier
    allowed_domains = ['quotes.toscrape.com']  # Optional: restrict domains
    start_urls = ['http://quotes.toscrape.com/']  # Starting URLs
    
    def parse(self, response):
        # This method is called for each start_url
        # Extract data and/or follow links
        pass

The response Object:

  • response.css() - Use CSS selectors
  • response.xpath() - Use XPath selectors
  • response.url - Current URL
  • response.status - HTTP status code
  • response.follow() - Follow links
# Complete Scrapy Spider Example (Simulated)
# Note: This is how a Scrapy spider looks - normally it runs in Scrapy framework

class QuotesSpider:
    """
    Example Scrapy Spider for quotes.toscrape.com
    This shows the structure and logic of a real Scrapy spider
    """
    
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    
    def parse(self, response):
        """
        Main parsing method - called for each response
        
        Args:
            response: Scrapy response object with methods:
                - response.css('selector') - CSS selectors
                - response.xpath('xpath') - XPath selectors  
                - response.follow(link) - Follow links
        """
        
        # Extract all quotes on the current page
        quotes = response.css('div.quote')
        
        for quote in quotes:
            # Extract individual fields using CSS selectors
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        
        # Follow the "Next" page link automatically
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            # This tells Scrapy to follow the link and call parse() again
            yield response.follow(next_page, self.parse)

# Let's simulate what Scrapy does behind the scenes
print("🕷️ Scrapy Spider Analysis:")
print("=" * 35)

print("\n1️⃣ Spider Attributes:")
spider = QuotesSpider()
print(f"   Name: {spider.name}")
print(f"   Allowed domains: {spider.allowed_domains}")
print(f"   Starting URLs: {spider.start_urls}")

print("\n2️⃣ How Scrapy Works:")
print("   Step 1: Scrapy sends requests to start_urls")
print("   Step 2: Calls parse() method with each response")
print("   Step 3: Spider yields data items and/or new requests")
print("   Step 4: Scrapy schedules new requests and processes items")
print("   Step 5: Repeats until no more requests")

print("\n3️⃣ Key Scrapy Concepts:")
print("   📥 yield items → Data extraction")
print("   📤 yield requests → Following links")
print("   🔄 response.follow() → Automatic link following")
print("   🎯 CSS/XPath selectors → Element selection")

print("\n4️⃣ Scrapy Selectors:")
print("   .get() → Get first match (like select_one)")
print("   .getall() → Get all matches (like select)")
print("   ::text → Extract text content")
print("   ::attr(name) → Extract attribute value")

print("\n5️⃣ Running the Spider:")
print("   Command: scrapy crawl quotes -o quotes.json")
print("   Output: Saves all extracted data to quotes.json")
print("   Automatically: Handles requests, follows links, exports data")
🕷️ Scrapy Spider Analysis:
===================================

1️⃣ Spider Attributes:
   Name: quotes
   Allowed domains: ['quotes.toscrape.com']
   Starting URLs: ['http://quotes.toscrape.com/']

2️⃣ How Scrapy Works:
   Step 1: Scrapy sends requests to start_urls
   Step 2: Calls parse() method with each response
   Step 3: Spider yields data items and/or new requests
   Step 4: Scrapy schedules new requests and processes items
   Step 5: Repeats until no more requests

3️⃣ Key Scrapy Concepts:
   📥 yield items → Data extraction
   📤 yield requests → Following links
   🔄 response.follow() → Automatic link following
   🎯 CSS/XPath selectors → Element selection

4️⃣ Scrapy Selectors:
   .get() → Get first match (like select_one)
   .getall() → Get all matches (like select)
   ::text → Extract text content
   ::attr(name) → Extract attribute value

5️⃣ Running the Spider:
   Command: scrapy crawl quotes -o quotes.json
   Output: Saves all extracted data to quotes.json
   Automatically: Handles requests, follows links, exports data

📚 Resources & Documentation - Organized by Library

🥄 Beautiful Soup Resources

Official Documentation:

Video Tutorials:

Articles & Tutorials:

🌐 Requests Library Resources

Official Documentation:

Video Tutorials:

Articles:

🕸️ Scrapy Framework Resources

Official Documentation:

Video Tutorials:

Books & Courses:

  • “Learning Scrapy” by Dimitris Kouzis-Loukas - Packt
  • “Web Scraping with Python and Scrapy” - Udemy courses
  • Scrapy GitHub Examples - Official examples

🚗 Selenium Resources

Official Documentation:

Video Tutorials:

Articles & Guides:

⚡ Parallel Processing & Advanced Topics

Joblib Resources:

Multiprocessing & Threading:

Performance & Optimization:

🛠️ General Web Scraping Resources

Practice Websites:

Alternative Libraries:

Data Processing:

📖 Books & Comprehensive Courses

Online Courses:

🎯 Learning Path Recommendations

Beginner (1-2 weeks):

  1. HTML/CSS basics
  2. Beautiful Soup fundamentals
  3. Simple scraping projects
  4. Practice websites

Intermediate (2-4 weeks):

  1. Selenium for dynamic content
  2. Error handling and robustness
  3. Data processing with pandas
  4. Multiple page scraping

Advanced (4+ weeks):

  1. Scrapy framework
  2. Parallel processing
  3. Large-scale projects
  4. Production deployment

🎲 00

Flag Counter