10 Scraping; Parallel Processing

Երևան, լուսանկարի հղումը, Հեղինակ՝ Armen Harutunian

TODO

(ToDo)

Song reference - ToDo

📌 Նկարագիր

📚 Ամբողջական նյութը

📺 Տեսանյութեր

🏡 Տնային

📚 Նյութը

🌐 HTML Basics - Understanding Web Structure

Before diving into web scraping, it’s essential to understand the structure of web pages. HTML (HyperText Markup Language) provides the structure of web pages.

What is HTML?

HTML uses tags to define elements. Tags are enclosed in angle brackets < > and usually come in pairs:

<tagname>Content goes here</tagname>

Basic HTML Document Structure:

<!DOCTYPE html>
<html>
<head>
    <title>Page Title</title>
</head>
<body>
    <h1>Main Heading</h1>
    <p>This is a paragraph.</p>
</body>
</html>

Common HTML Tags for Scraping:

Document Structure:

<html> - Root element
<head> - Contains metadata
<title> - Page title
<body> - Visible page content

Text Content:

<h1>, <h2>, <h3> - Headers (most important to least)
<p> - Paragraphs
<span> - Inline text container

Containers:

<div> - Block-level container (most common)
<section> - Semantic section
<article> - Independent content

Lists and Links:

<ul>, <ol>, <li> - Unordered/ordered lists and list items
<a> - Links
<img> - Images

Data Tables:

<table>, <tr>, <td>, <th> - Tables, rows, cells, headers

HTML Attributes - The Key to Scraping

Attributes provide additional information about elements and are crucial for web scraping:

<div id="content" class="main-section">
<a href="https://example.com" target="_blank">Link</a>
<img src="image.jpg" alt="Description">
<div data-price="29.99" data-category="electronics">Product</div>

Most Important Attributes for Scraping: - id - Unique identifier (use with # in CSS selectors) - class - CSS class name(s) (use with . in CSS selectors) - href - Link destination - src - Source for images/scripts - data-* - Custom data attributes (very common in modern websites)

Why Attributes Matter: - They help us target specific elements - They often contain valuable data - They make our scrapers more precise

🎯 CSS Selectors - Your Scraping Toolkit

CSS selectors are THE MOST IMPORTANT concept in web scraping. They tell your scraper exactly which elements to extract.

Basic Selectors:

1. Element Selector:

p          /* Selects all <p> elements */
div        /* Selects all <div> elements */
h1         /* Selects all <h1> elements */

2. Class Selector (starts with .):

.classname     /* Selects elements with class="classname" */
.post-title    /* Selects elements with class="post-title" */
.btn-primary   /* Selects elements with class="btn-primary" */

3. ID Selector (starts with #):

#idname        /* Selects element with id="idname" */
#main-content  /* Selects element with id="main-content" */
#header        /* Selects element with id="header" */

4. Attribute Selector:

[href]                    /* Elements with href attribute */
[class="post"]           /* Elements with class="post" */
[data-price="29.99"]     /* Elements with data-price="29.99" */

Advanced CSS Selectors:

Combination Selectors:

div p              /* All <p> inside <div> (descendant) */
div > p            /* Direct <p> children of <div> */
h1 + p             /* First <p> immediately after <h1> */
.post .title       /* Elements with class "title" inside elements with class "post" */

Multiple Classes:

.post.featured     /* Elements with BOTH classes "post" AND "featured" */
.btn.btn-primary   /* Elements with BOTH classes "btn" AND "btn-primary" */

Pseudo-selectors:

p:first-child      /* First <p> element of its parent */
p:last-child       /* Last <p> element of its parent */
p:nth-child(2)     /* Second <p> element of its parent */
a:contains("Next") /* Links containing text "Next" */

Complex Examples:

div.post-content p.highlight    /* <p> with class "highlight" inside <div> with class "post-content" */
#main-content .sidebar a[href]  /* Links inside sidebar inside main content */
table tr:nth-child(odd)         /* Odd rows in a table */

🥄 Beautiful Soup - Your HTML Parsing Companion

Beautiful Soup is perfect for beginners and handles most scraping tasks effectively. It makes parsing HTML as easy as navigating a family tree!

Why Beautiful Soup?

Easy to learn: Intuitive syntax
Powerful: Handles messy HTML gracefully
Flexible: Multiple ways to find elements
Robust: Handles encoding issues automatically

Core Concepts:

Parsing: Convert HTML text into a navigable object
Searching: Find specific elements using tags, attributes, or CSS selectors
Extracting: Get text, attributes, or sub-elements
Navigating: Move between parent, children, and sibling elements

Sample HTML

from bs4 import BeautifulSoup

# Sample HTML for practicing CSS selectors
practice_html = """
<html>
<body>
    <div id="header" class="top-section">
        <h1 class="main-title">Welcome to Our Store</h1>
        <nav class="navigation">
            <a href="/home">Home</a>
            <a href="/products">Products</a>
            <a href="/contact">Contact</a>
        </nav>
    </div>
    
    <div id="main-content">
        <div class="product featured" data-price="199.99">
            <h2 class="product-title">iPhone 15</h2>
            <p class="description">Latest smartphone with amazing features</p>
            <span class="price">$199.99</span>
        </div>
        
        <div class="product" data-price="89.99">
            <h2 class="product-title">Headphones</h2>
            <p class="description">High-quality wireless headphones</p>
            <span class="price">$89.99</span>
        </div>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(practice_html, 'html.parser')

print(soup.prettify())

<html>
 <body>
  <div class="top-section" id="header">
   <h1 class="main-title">
    Welcome to Our Store
   </h1>
   <nav class="navigation">
    <a href="/home">
     Home
    </a>
    <a href="/products">
     Products
    </a>
    <a href="/contact">
     Contact
    </a>
   </nav>
  </div>
  <div id="main-content">
   <div class="product featured" data-price="199.99">
    <h2 class="product-title">
     iPhone 15
    </h2>
    <p class="description">
     Latest smartphone with amazing features
    </p>
    <span class="price">
     $199.99
    </span>
   </div>
   <div class="product" data-price="89.99">
    <h2 class="product-title">
     Headphones
    </h2>
    <p class="description">
     High-quality wireless headphones
    </p>
    <span class="price">
     $89.99
    </span>
   </div>
  </div>
 </body>
</html>

soup.select('h2')

soup.h2

<h2 class="product-title">iPhone 15</h2>

soup.select('.product') # . - class

[<div class="product featured" data-price="199.99">
 <h2 class="product-title">iPhone 15</h2>
 <p class="description">Latest smartphone with amazing features</p>
 <span class="price">$199.99</span>
 </div>,
 <div class="product" data-price="89.99">
 <h2 class="product-title">Headphones</h2>
 <p class="description">High-quality wireless headphones</p>
 <span class="price">$89.99</span>
 </div>]

soup.select('#main-content') # # - id

[<div id="main-content">
 <div class="product featured" data-price="199.99">
 <h2 class="product-title">iPhone 15</h2>
 <p class="description">Latest smartphone with amazing features</p>
 <span class="price">$199.99</span>
 </div>
 <div class="product" data-price="89.99">
 <h2 class="product-title">Headphones</h2>
 <p class="description">High-quality wireless headphones</p>
 <span class="price">$89.99</span>
 </div>
 </div>]


# 1. Basic selectors
print("\n1️⃣ Basic Selectors:")
print(f"All h2 elements: {len(soup.select('h2'))} found")
print(f"Elements with class 'product': {len(soup.select('.product'))} found")
print(f"Element with id 'header': {len(soup.select('#header'))} found")


1️⃣ Basic Selectors:
All h2 elements: 2 found
Elements with class 'product': 2 found
Element with id 'header': 1 found

# 2. Find specific content
print("2️⃣ Finding Specific Content:")
main_title = soup.select_one('h1.main-title')


print(main_title.text)

2️⃣ Finding Specific Content:
Welcome to Our Store

navigation_links = soup.select('nav.navigation a')
print(navigation_links)

for nav in navigation_links:
    print(nav.text)
    print(nav["href"]) # hyper href attribute

[<a href="/home">Home</a>, <a href="/products">Products</a>, <a href="/contact">Contact</a>]
Home
/home
Products
/products
Contact
/contact

# 3. Product information
print("\n3️⃣ Extract Product Information:")
products = soup.select('div.product')

print(products)
products[0]


3️⃣ Extract Product Information:
[<div class="product featured" data-price="199.99">
<h2 class="product-title">iPhone 15</h2>
<p class="description">Latest smartphone with amazing features</p>
<span class="price">$199.99</span>
</div>, <div class="product" data-price="89.99">
<h2 class="product-title">Headphones</h2>
<p class="description">High-quality wireless headphones</p>
<span class="price">$89.99</span>
</div>]

<div class="product featured" data-price="199.99">
<h2 class="product-title">iPhone 15</h2>
<p class="description">Latest smartphone with amazing features</p>
<span class="price">$199.99</span>
</div>


for product in products:
    title = product.select_one('.product-title').text
    price = product.select_one('.price').text
    is_featured = 'featured' in product.get('class', [])
    
    print("title")
    print(f"  Price: {price}")
    print(f"  Featured: {'✅' if is_featured else '❌'}")

title
  Price: $199.99
  Featured: ✅
title
  Price: $89.99
  Featured: ❌

# 4. Advanced selectors
print("\n4️⃣ Advanced Selectors:")
featured_product = soup.select_one('.product.featured .product-title')
if featured_product:
    print(f"Featured product: {featured_product.text}")

expensive_products = soup.select('[data-price]')
print(f"Products with price data: {len(expensive_products)}")


4️⃣ Advanced Selectors:
Featured product: iPhone 15
Products with price data: 2

# Beautiful Soup Complete Example - Part 2: Finding Elements
# Using the same soup object from Part 1

print("🥄 Beautiful Soup Complete Example - Part 2")
print("=" * 50)

print("\n🔍 Different Ways to Find Elements:")

# Method 1: By tag name
print("\n1️⃣ Find by tag name:")
titles = soup.find_all('h2')
print(f"   Found {len(titles)} h2 elements:")
for i, title in enumerate(titles, 1):
    print(f"     {i}. {title.text}")

# Method 2: By class
print("\n2️⃣ Find by class:")
posts = soup.find_all('span', class_='price')
print(f"   Found {len(posts)} articles with class 'post'")

# Method 3: By ID
print("\n3️⃣ Find by ID:")
header = soup.find('header', id='main-header')
if header:
    print(f"   Header found: {header.h1.text}")

# Method 4: By multiple attributes
print("\n4️⃣ Find by multiple attributes:")
featured_post = soup.find('article', {'class': 'post featured'})
if featured_post:
    print(f"   Featured post: {featured_post.find('h2').text}")

# Method 5: By custom attributes
print("\n5️⃣ Find by custom attributes:")
post_123 = soup.find('article', {'data-id': '123'})
if post_123:
    author = post_123.find('span', class_='author').text
    print(f"   Post 123 author: {author}")

print("\n💡 Key Takeaway:")
print("   find() → First match or None")
print("   find_all() → List of all matches")

🌐 Real Website Scraping Example

# Step 1: Import required libraries for web scraping
import requests
from bs4 import BeautifulSoup

import time
import json

print("✅ Libraries imported successfully!")

url = "http://quotes.toscrape.com/"

response = requests.get(url)
response.raise_for_status()  # Raise exception for bad status codes

print(response.status_code)

soup = BeautifulSoup(response.text, 'html.parser')

print(soup.prettify())

200
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert-Einstein">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
       <a class="tag" href="/tag/change/page/1/">
        change
       </a>
       <a class="tag" href="/tag/deep-thoughts/page/1/">
        deep-thoughts
       </a>
       <a class="tag" href="/tag/thinking/page/1/">
        thinking
       </a>
       <a class="tag" href="/tag/world/page/1/">
        world
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “It is our choices, Harry, that show what we truly are, far more than our abilities.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        J.K. Rowling
       </small>
       <a href="/author/J-K-Rowling">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
       <a class="tag" href="/tag/abilities/page/1/">
        abilities
       </a>
       <a class="tag" href="/tag/choices/page/1/">
        choices
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert-Einstein">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
       <a class="tag" href="/tag/inspirational/page/1/">
        inspirational
       </a>
       <a class="tag" href="/tag/life/page/1/">
        life
       </a>
       <a class="tag" href="/tag/live/page/1/">
        live
       </a>
       <a class="tag" href="/tag/miracle/page/1/">
        miracle
       </a>
       <a class="tag" href="/tag/miracles/page/1/">
        miracles
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Jane Austen
       </small>
       <a href="/author/Jane-Austen">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
       <a class="tag" href="/tag/aliteracy/page/1/">
        aliteracy
       </a>
       <a class="tag" href="/tag/books/page/1/">
        books
       </a>
       <a class="tag" href="/tag/classic/page/1/">
        classic
       </a>
       <a class="tag" href="/tag/humor/page/1/">
        humor
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Marilyn Monroe
       </small>
       <a href="/author/Marilyn-Monroe">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
       <a class="tag" href="/tag/be-yourself/page/1/">
        be-yourself
       </a>
       <a class="tag" href="/tag/inspirational/page/1/">
        inspirational
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “Try not to become a man of success. Rather become a man of value.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert-Einstein">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
       <a class="tag" href="/tag/adulthood/page/1/">
        adulthood
       </a>
       <a class="tag" href="/tag/success/page/1/">
        success
       </a>
       <a class="tag" href="/tag/value/page/1/">
        value
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “It is better to be hated for what you are than to be loved for what you are not.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        André Gide
       </small>
       <a href="/author/Andre-Gide">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="life,love" itemprop="keywords"/>
       <a class="tag" href="/tag/life/page/1/">
        life
       </a>
       <a class="tag" href="/tag/love/page/1/">
        love
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “I have not failed. I've just found 10,000 ways that won't work.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Thomas A. Edison
       </small>
       <a href="/author/Thomas-A-Edison">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
       <a class="tag" href="/tag/edison/page/1/">
        edison
       </a>
       <a class="tag" href="/tag/failure/page/1/">
        failure
       </a>
       <a class="tag" href="/tag/inspirational/page/1/">
        inspirational
       </a>
       <a class="tag" href="/tag/paraphrased/page/1/">
        paraphrased
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Eleanor Roosevelt
       </small>
       <a href="/author/Eleanor-Roosevelt">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
       <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">
        misattributed-eleanor-roosevelt
       </a>
      </div>
     </div>
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “A day without sunshine is like, you know, night.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Steve Martin
       </small>
       <a href="/author/Steve-Martin">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
       <a class="tag" href="/tag/humor/page/1/">
        humor
       </a>
       <a class="tag" href="/tag/obvious/page/1/">
        obvious
       </a>
       <a class="tag" href="/tag/simile/page/1/">
        simile
       </a>
      </div>
     </div>
     <nav>
      <ul class="pager">
       <li class="next">
        <a href="/page/2/">
         Next
         <span aria-hidden="true">
          →
         </span>
        </a>
       </li>
      </ul>
     </nav>
    </div>
    <div class="col-md-4 tags-box">
     <h2>
      Top Ten tags
     </h2>
     <span class="tag-item">
      <a class="tag" href="/tag/love/" style="font-size: 28px">
       love
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/inspirational/" style="font-size: 26px">
       inspirational
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/life/" style="font-size: 26px">
       life
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/humor/" style="font-size: 24px">
       humor
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/books/" style="font-size: 22px">
       books
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/reading/" style="font-size: 14px">
       reading
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/friendship/" style="font-size: 10px">
       friendship
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/friends/" style="font-size: 8px">
       friends
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/truth/" style="font-size: 8px">
       truth
      </a>
     </span>
     <span class="tag-item">
      <a class="tag" href="/tag/simile/" style="font-size: 6px">
       simile
      </a>
     </span>
    </div>
   </div>
  </div>
  <footer class="footer">
   <div class="container">
    <p class="text-muted">
     Quotes by:
     <a href="https://www.goodreads.com/quotes">
      GoodReads.com
     </a>
    </p>
    <p class="copyright">
     Made with
     <span class="zyte">
      ❤
     </span>
     by
     <a class="zyte" href="https://www.zyte.com">
      Zyte
     </a>
    </p>
   </div>
  </footer>
 </body>
</html>

soup.select("h1 a")[0].text

[<h1>
 <a href="/" style="text-decoration: none">Quotes to Scrape</a>
 </h1>]

quotes = soup.find_all("div", class_="quote")

soup.select("div.quote")

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>
 <a class="tag" href="/tag/choices/page/1/">choices</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 <a class="tag" href="/tag/life/page/1/">life</a>
 <a class="tag" href="/tag/live/page/1/">live</a>
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
 <span>by <small class="author" itemprop="author">Jane Austen</small>
 <a href="/author/Jane-Austen">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
 <a class="tag" href="/tag/books/page/1/">books</a>
 <a class="tag" href="/tag/classic/page/1/">classic</a>
 <a class="tag" href="/tag/humor/page/1/">humor</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
 <span>by <small class="author" itemprop="author">Marilyn Monroe</small>
 <a href="/author/Marilyn-Monroe">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
 <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
 <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
 <a class="tag" href="/tag/success/page/1/">success</a>
 <a class="tag" href="/tag/value/page/1/">value</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
 <span>by <small class="author" itemprop="author">André Gide</small>
 <a href="/author/Andre-Gide">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="life,love" itemprop="keywords"/>
 <a class="tag" href="/tag/life/page/1/">life</a>
 <a class="tag" href="/tag/love/page/1/">love</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
 <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
 <a href="/author/Thomas-A-Edison">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
 <a class="tag" href="/tag/edison/page/1/">edison</a>
 <a class="tag" href="/tag/failure/page/1/">failure</a>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
 <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
 <a href="/author/Eleanor-Roosevelt">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
 <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
 <span>by <small class="author" itemprop="author">Steve Martin</small>
 <a href="/author/Steve-Martin">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
 <a class="tag" href="/tag/humor/page/1/">humor</a>
 <a class="tag" href="/tag/obvious/page/1/">obvious</a>
 <a class="tag" href="/tag/simile/page/1/">simile</a>
 </div>
 </div>]

len(quotes)

for quote in quotes:
    text = quote.select_one(".text").text
    author = quote.select_one(".author").text
    tags = [tag.text for tag in quote.select(".tag")]



quotes[0].select(".tag")[1].text

'deep-thoughts'

for i in range(10):
    print(f"Page {i + 1}")
    url = f"https://quotes.toscrape.com/page/{i}/"
    print(url)
    response = requests.get(url)
    response.raise_for_status()  # Raise exception for bad status codes

# or find next

def process_page(url):
    print("processing", url)
    response = requests.get(url)
    try:
        response.raise_for_status()  # Raise exception for bad status codes
    except requests.exceptions.HTTPError as e:
        print(f"Error fetching {url}: {e}")
        return
    soup = BeautifulSoup(response.text, 'html.parser')
    
    quotes = soup.select("div.quote")
    for quote in quotes:
        text = quote.select_one(".text").text
        author = quote.select_one(".author").text
        tags = [tag.text for tag in quote.select(".tag")]
        # print(f"Text: {text}, Author: {author}, Tags: {tags}")
        break
    
    next_page = soup.select_one(".next a")["href"]
    next_url = f"https://quotes.toscrape.com{next_page}"
    
    process_page(next_url)

process_page("https://quotes.toscrape.com/")

!pip install beautifulsoup4 requests

ToDo  

1. Advanced Parsing
2. pip install 
3. robots.txt # file:///C:/Users/hayk_/Downloads/sitemap-1.xml/sitemap-1.xml 
4. waiting
5. joblib

🎯 Advanced Beautiful Soup Techniques

1. Different Parsing Methods:

# Advanced Beautiful Soup techniques
from bs4 import BeautifulSoup
import re

sample_html = """
<div class="container">
    <div class="product" data-price="29.99" data-category="electronics">
        <h3>Smartphone</h3>
        <p class="description">Latest smartphone with amazing features</p>
        <span class="price">$29.99</span>
        <div class="reviews">
            <span class="rating">4.5</span>
            <span class="review-count">(150 reviews)</span>
        </div>
    </div>
    
    <div class="product" data-price="599.99" data-category="electronics">
        <h3>Laptop</h3>
        <p class="description">High-performance laptop for professionals</p>
        <span class="price">$599.99</span>
        <div class="reviews">
            <span class="rating">4.8</span>
            <span class="review-count">(89 reviews)</span>
        </div>
    </div>
    
    <article class="blog-post">
        <h2>Tech News</h2>
        <p>Latest technology trends and updates...</p>
        <time datetime="2025-01-15">January 15, 2025</time>
    </article>
</div>
"""

soup = BeautifulSoup(sample_html, 'html.parser')

print("🔧 Advanced Beautiful Soup Techniques:")
print("=" * 50)

# 1. Find with attributes
print("\n1️⃣ Finding by attributes:")
expensive_products = soup.find_all('div', {'data-price': lambda x: x and float(x) > 100})

for product in expensive_products:
    name = product.h3.text
    price = product.get('data-price')
    print(f"   {name}: ${price}")

🔧 Advanced Beautiful Soup Techniques:
==================================================

1️⃣ Finding by attributes:
   Laptop: $599.99


# 2. Using regular expressions
print("\n2️⃣ Using regex patterns:")
price_spans = soup.find_all('span', string=re.compile(r'\$\d+\.\d+'))
# r'\$\d+\.\d+' - this regex matches dollar amounts like $29.99
for span in price_spans:
    print(f"   Found price: {span.text}")


2️⃣ Using regex patterns:
   Found price: $29.99
   Found price: $599.99

# 3. CSS selectors advanced
print("\n3️⃣ Advanced CSS selectors:")
# Products with rating above 4.6
high_rated = soup.select('div.product:has(.rating)')
for product in high_rated:
    name = product.h3.text
    rating = product.select_one('.rating').text
    if float(rating) > 4.6:
        print(f"   High-rated: {name} ({rating}⭐)")


3️⃣ Advanced CSS selectors:
   High-rated: Laptop (4.8⭐)

# 4. Parent and sibling navigation
print("\n4️⃣ Navigation between elements:")
rating_element = soup.find('span', class_='rating')
if rating_element:
    # Get parent
    reviews_div = rating_element.parent
    print(f"   Parent element: {reviews_div.name} with class '{reviews_div.get('class', [])}'")
    
    # Get sibling
    review_count = rating_element.find_next_sibling('span')
    print(f"   Review count: {review_count.text}")
    
    print(review_count.find_next_sibling('span'))


4️⃣ Navigation between elements:
   Parent element: div with class '['reviews']'
   Review count: (150 reviews)
None

🚗 Selenium - For Dynamic and JavaScript-Heavy Websites

🤔 When Do You Need Selenium?

Beautiful Soup + Requests works great for static HTML, but many modern websites use JavaScript to load content dynamically. This is where Selenium comes in.

Signs You Need Selenium:

Content loads after the page loads (AJAX)
You need to click buttons or fill forms
The data you want appears only after user interaction
The website is a Single Page Application (SPA)
You see “Loading…” messages or spinners

What Selenium Does:

Controls a real browser (Chrome, Firefox, Safari)
Executes JavaScript like a human user
Waits for content to load dynamically
Simulates user actions (clicks, typing, scrolling)

⚡ Selenium vs Beautiful Soup Comparison:

Feature	Beautiful Soup	Selenium
Speed	⚡ Very fast	🐌 Slower (launches browser)
JavaScript	❌ No support	✅ Full support
User Interaction	❌ Cannot click/type	✅ Can simulate user actions
Memory Usage	💚 Low	🔴 High (browser overhead)
Complexity	💚 Simple	🟡 More complex setup
Best For	Static websites	Dynamic/interactive websites

🛠️ Selenium Installation & Setup

Step 1: Install Selenium

pip install selenium webdriver-manager

Step 2: Understanding WebDrivers

Selenium needs a WebDriver to control browsers: - ChromeDriver - For Google Chrome - GeckoDriver - For Firefox
- EdgeDriver - For Microsoft Edge

Good News: webdriver-manager automatically downloads the correct driver!

Step 3: Basic Setup Options

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Option 1: Visible browser (for development/debugging)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Option 2: Headless browser (for production)
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

Step 4: Common Chrome Options

options = webdriver.ChromeOptions()
options.add_argument("--headless")          # Run without GUI
options.add_argument("--no-sandbox")        # Required for some environments
options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource problems
options.add_argument("--window-size=1920,1080")  # Set window size
options.add_argument("--user-agent=Custom User Agent")  # Custom user agent

# Install Selenium and WebDriver
!pip install selenium webdriver-manager

# Selenium Basic Example - Part 1: Setup and Navigation
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

def selenium_basic_demo():
    """Demonstrate Selenium basic usage"""
    
    print("🚗 Selenium Basic Demo - Part 1: Setup")
    print("=" * 45)
    
    # Setup Chrome options for demo
    options = Options()
    options.add_argument("--headless")  # Run without GUI for demo
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    
    try:
        print("\n📋 Step 1: Initialize WebDriver")
        # This automatically downloads ChromeDriver if needed
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=options)
        print("   ✅ Chrome WebDriver initialized successfully")
        
        print("\n🌐 Step 2: Navigate to website")
        url = "https://quotes.toscrape.com/js/"  # JavaScript version
        driver.get(url)
        print(f"   📍 Navigated to: {url}")
        print(f"   📄 Page title: {driver.title}")
        
        print("\n⏳ Step 3: Wait for content to load")
        # Wait up to 10 seconds for quotes to appear
        wait = WebDriverWait(driver, 10)
        quotes = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "quote")))
        print(f"   ✅ Found {len(quotes)} quotes after waiting for JavaScript")
        
        print(f"\n📊 Page Information:")
        print(f"   Current URL: {driver.current_url}")
        print(f"   Page source length: {len(driver.page_source):,} characters")
        
        return driver, quotes
        
    except Exception as e:
        print(f"❌ Error in Selenium demo: {e}")
        return None, []

# Note: This demo shows setup - actual scraping in next cell
print("� Note: This example shows Selenium setup and navigation.")
print("� For full functionality, Chrome browser and ChromeDriver are required.")
print("� In Colab/Jupyter environments, additional setup might be needed.")

# Uncomment the line below to run the demo (if Chrome is available)
# driver, quotes = selenium_basic_demo()

# Selenium Basic Example - Part 2: Data Extraction and Interaction
# This continues from Part 1

def selenium_scraping_demo():
    """Demonstrate Selenium data extraction and interaction"""
    
    print("🚗 Selenium Basic Demo - Part 2: Data Extraction")
    print("=" * 50)
    
    print("\n📝 Common Selenium Element Location Methods:")
    print("   By.CLASS_NAME    → find_element(By.CLASS_NAME, 'quote')")
    print("   By.ID            → find_element(By.ID, 'main-content')")
    print("   By.TAG_NAME      → find_element(By.TAG_NAME, 'h1')")
    print("   By.CSS_SELECTOR  → find_element(By.CSS_SELECTOR, '.quote .text')")
    print("   By.XPATH         → find_element(By.XPATH, '//div[@class=\"quote\"]')")
    
    # Simulated data extraction (would work with real driver)
    simulated_quotes = [
        {
            'text': '"The world as we have created it is a process of our thinking."',
            'author': 'Albert Einstein',
            'tags': ['change', 'deep-thoughts', 'thinking', 'world']
        },
        {
            'text': '"It is our choices, Harry, that show what we truly are."',
            'author': 'J.K. Rowling',
            'tags': ['abilities', 'choices']
        },
        {
            'text': '"There are only two ways to live your life."',
            'author': 'Albert Einstein',
            'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']
        }
    ]
    
    print(f"\n🔍 Extracting Data with Selenium:")
    print("   (Simulated - shows the process)")
    
    for i, quote_data in enumerate(simulated_quotes, 1):
        print(f"\n💬 Quote {i}:")
        print(f"   Text: {quote_data['text']}")
        print(f"   Author: {quote_data['author']}")
        print(f"   Tags: {', '.join(quote_data['tags'])}")
    
    print(f"\n🎯 Real Selenium Code Pattern:")
    selenium_code = '''
# Real Selenium extraction code:
quotes = driver.find_elements(By.CLASS_NAME, "quote")

for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, "text").text
    author = quote.find_element(By.CLASS_NAME, "author").text
    tags = [tag.text for tag in quote.find_elements(By.CLASS_NAME, "tag")]
    
    quote_data = {
        'text': text,
        'author': author,
        'tags': tags
    }
'''
    
    print(selenium_code)
    
    print(f"\n🖱️ Selenium Interaction Examples:")
    interaction_code = '''
# Click elements
button = driver.find_element(By.ID, "load-more-btn")
button.click()

# Fill forms
search_box = driver.find_element(By.NAME, "search")
search_box.send_keys("python")
search_box.submit()

# Scroll page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for specific conditions
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, "submit-btn")))
'''
    
    print(interaction_code)
    
    print(f"\n⚠️ Important Selenium Concepts:")
    print("   🕐 Explicit Waits: Wait for specific conditions")
    print("   🕑 Implicit Waits: Global wait time for all elements")
    print("   🎭 Headless Mode: Run without visible browser")
    print("   🔒 Always Close: driver.quit() to free resources")

# Run the demo
selenium_scraping_demo()

🚀 Parallel Web Scraping & Multiprocessing

When scraping large amounts of data, performance becomes crucial. Python’s multiprocessing and libraries like joblib allow us to speed up scraping by processing multiple URLs simultaneously.

🧠 Why Use Parallel Processing?

Sequential Processing: - Scrapes one URL at a time - Total time = (number of URLs) × (average time per URL) - CPU cores remain underutilized

Parallel Processing: - Scrapes multiple URLs simultaneously - Total time ≈ (number of URLs) ÷ (number of workers) × (average time per URL) - Better resource utilization

⚠️ Important: Always respect websites’ rate limits and robots.txt when using parallel processing!

🔧 Basic Multiprocessing Concepts

Before applying multiprocessing to web scraping, let’s understand the basics with simple examples.

📦 Introduction to Joblib

joblib is a powerful library that makes parallel computing easy and efficient. It’s particularly great for: - CPU-bound tasks - Machine learning workloads - Data processing pipelines

Key advantages: - Simple API: Parallel(n_jobs=-1)(delayed(function)(args) for args in data) - Automatic memory optimization - Built-in progress tracking - Works well with NumPy arrays

!pip install joblib

# Install joblib if not already installed

from joblib import Parallel, delayed
import numpy as np

def process_data(x):
    """Simulate data processing"""
    time.sleep(1)
    return x ** 3 + 2 * x ** 2 + x + 1

data = list(range(1, 51)) 

print("\n🐌 Sequential Processing:")
start_time = time.time()
sequential_results = [process_data(x) for x in data]
sequential_time = time.time() - start_time


🐌 Sequential Processing:

print(f"   Time taken: {sequential_time:.2f} seconds")

   Time taken: 3.12 seconds

# Parallel(n_jobs=-1)(delayed(process_data)(x) for x in data)
print("\n⚡ Joblib Parallel (all cores):") # -1 -> All
start_time = time.time()
parallel_results = Parallel(n_jobs=-1, verbose=1)(delayed(process_data)(x) for x in data)
parallel_time = time.time() - start_time


Parallel(n_jobs=-1, verbose=1)(delayed(process_data)(x) for x in data)


⚡ Joblib Parallel (all cores):

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    7.0s finished
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    7.0s finished

# Parallel(n_jobs=-1)(delayed(process_data)(x) for x in data)
print("\n⚡ Joblib Parallel (all cores):") # -1 -> All
start_time = time.time()
parallel_results = Parallel(n_jobs=3, verbose=-1)(delayed(process_data)(x) for x in data)
parallel_time = time.time() - start_time


Parallel(n_jobs=-1, verbose=1)(delayed(process_data)(x) for x in data)


⚡ Joblib Parallel (all cores):

[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  50 out of  50 | elapsed:   18.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=3)]: Done  50 out of  50 | elapsed:   18.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    8.5s finished
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    8.5s finished

parallel_results

print(f"   Time taken: {parallel_time:.2f} seconds")
print(f"   Speedup: {sequential_time/parallel_time:.2f}x faster")

   Time taken: 0.50 seconds
   Speedup: 6.21x faster

print("\n📊 Joblib with Progress Tracking:")
start_time = time.time()
parallel_results_verbose = Parallel(n_jobs=4, verbose=1)(
    delayed(process_data)(x) for x in data
)
verbose_time = time.time() - start_time
print(f"   Time taken: {verbose_time:.2f} seconds")


📊 Joblib with Progress Tracking:

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.

   Time taken: 1.48 seconds

[Parallel(n_jobs=4)]: Done  50 out of  50 | elapsed:    1.4s finished

🌐 Parallel Web Scraping Examples

Now let’s apply these concepts to web scraping. We’ll compare sequential vs parallel approaches for scraping multiple URLs.

🛡️ Advanced Parallel Scraping with Rate Limiting

When scraping real websites, we need to be more careful about rate limiting, error handling, and respecting server resources.

https://www.ysu.am/robots.txt

⚡ Performance Optimization & Best Practices

🎯 Choosing the Right Approach

Method	Best For	Pros	Cons
Sequential	Small datasets, strict rate limits	Simple, predictable	Slow for large datasets
Threading	I/O-bound tasks, many small requests	Good for network-bound tasks	GIL limitations in Python
Multiprocessing	CPU-intensive parsing	True parallelism	Higher memory usage
Joblib	Balanced approach, data science tasks	Easy to use, optimized	Extra dependency

🛡️ Rate Limiting Strategies

# 1. Fixed delay between requests
time.sleep(1)

# 2. Random delay (more human-like)
time.sleep(random.uniform(0.5, 2.0))

# 3. Exponential backoff on errors
wait_time = (2 ** attempt) + random.uniform(0, 1)

# 4. Domain-specific rate limiting
# Different limits for different websites

🕸️ Scrapy Framework - Industrial-Strength Web Scraping

Scrapy is not just a library - it’s a complete framework for building web scrapers. Think of it as the difference between a hammer (Beautiful Soup) and a complete construction toolkit (Scrapy).

🤔 When to Choose Scrapy vs Beautiful Soup?

Use Case	Beautiful Soup	Scrapy
Simple, one-time scraping	✅ Perfect	❌ Overkill
Large-scale projects	❌ Limited	✅ Excellent
Multiple websites	❌ Manual work	✅ Built-in support
Following links automatically	❌ Manual coding	✅ Built-in
Data export (CSV, JSON)	❌ Manual coding	✅ Built-in
Handling cookies/sessions	❌ Manual coding	✅ Automatic
Concurrent requests	❌ Manual threading	✅ Built-in
Respecting robots.txt	❌ Manual checking	✅ Automatic

🏗️ Scrapy Architecture

Scrapy follows a component-based architecture:

Engine - Controls data flow between components
Scheduler - Manages which URLs to scrape next
Downloader - Fetches web pages
Spiders - Your custom logic for extracting data
Item Pipeline - Processes extracted data
Middlewares - Hooks for customizing requests/responses

🚀 Getting Started with Scrapy

Installation:

pip install scrapy

Creating a Scrapy Project:

# Create new project
scrapy startproject myproject

# Project structure created:
myproject/
    scrapy.cfg            # deploy configuration file
    myproject/            # project's Python module
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # directory for spiders
            __init__.py

Key Files Explained:

spiders/ - Where you write your scraping logic
items.py - Define what data you want to extract
pipelines.py - Process the extracted data
settings.py - Configure how Scrapy behaves

🕷️ Understanding Scrapy Spiders

A Spider is a class that defines how to scrape a website. Every spider must: 1. Have a unique name 2. Define starting URLs 3. Implement a parse method

Basic Spider Structure:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'                    # Unique identifier
    allowed_domains = ['quotes.toscrape.com']  # Optional: restrict domains
    start_urls = ['http://quotes.toscrape.com/']  # Starting URLs
    
    def parse(self, response):
        # This method is called for each start_url
        # Extract data and/or follow links
        pass

The `response` Object:

response.css() - Use CSS selectors
response.xpath() - Use XPath selectors
response.url - Current URL
response.status - HTTP status code
response.follow() - Follow links

# Complete Scrapy Spider Example (Simulated)
# Note: This is how a Scrapy spider looks - normally it runs in Scrapy framework

class QuotesSpider:
    """
    Example Scrapy Spider for quotes.toscrape.com
    This shows the structure and logic of a real Scrapy spider
    """
    
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    
    def parse(self, response):
        """
        Main parsing method - called for each response
        
        Args:
            response: Scrapy response object with methods:
                - response.css('selector') - CSS selectors
                - response.xpath('xpath') - XPath selectors  
                - response.follow(link) - Follow links
        """
        
        # Extract all quotes on the current page
        quotes = response.css('div.quote')
        
        for quote in quotes:
            # Extract individual fields using CSS selectors
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        
        # Follow the "Next" page link automatically
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            # This tells Scrapy to follow the link and call parse() again
            yield response.follow(next_page, self.parse)

# Let's simulate what Scrapy does behind the scenes
print("🕷️ Scrapy Spider Analysis:")
print("=" * 35)

print("\n1️⃣ Spider Attributes:")
spider = QuotesSpider()
print(f"   Name: {spider.name}")
print(f"   Allowed domains: {spider.allowed_domains}")
print(f"   Starting URLs: {spider.start_urls}")

print("\n2️⃣ How Scrapy Works:")
print("   Step 1: Scrapy sends requests to start_urls")
print("   Step 2: Calls parse() method with each response")
print("   Step 3: Spider yields data items and/or new requests")
print("   Step 4: Scrapy schedules new requests and processes items")
print("   Step 5: Repeats until no more requests")

print("\n3️⃣ Key Scrapy Concepts:")
print("   📥 yield items → Data extraction")
print("   📤 yield requests → Following links")
print("   🔄 response.follow() → Automatic link following")
print("   🎯 CSS/XPath selectors → Element selection")

print("\n4️⃣ Scrapy Selectors:")
print("   .get() → Get first match (like select_one)")
print("   .getall() → Get all matches (like select)")
print("   ::text → Extract text content")
print("   ::attr(name) → Extract attribute value")

print("\n5️⃣ Running the Spider:")
print("   Command: scrapy crawl quotes -o quotes.json")
print("   Output: Saves all extracted data to quotes.json")
print("   Automatically: Handles requests, follows links, exports data")

🕷️ Scrapy Spider Analysis:
===================================

1️⃣ Spider Attributes:
   Name: quotes
   Allowed domains: ['quotes.toscrape.com']
   Starting URLs: ['http://quotes.toscrape.com/']

2️⃣ How Scrapy Works:
   Step 1: Scrapy sends requests to start_urls
   Step 2: Calls parse() method with each response
   Step 3: Spider yields data items and/or new requests
   Step 4: Scrapy schedules new requests and processes items
   Step 5: Repeats until no more requests

3️⃣ Key Scrapy Concepts:
   📥 yield items → Data extraction
   📤 yield requests → Following links
   🔄 response.follow() → Automatic link following
   🎯 CSS/XPath selectors → Element selection

4️⃣ Scrapy Selectors:
   .get() → Get first match (like select_one)
   .getall() → Get all matches (like select)
   ::text → Extract text content
   ::attr(name) → Extract attribute value

5️⃣ Running the Spider:
   Command: scrapy crawl quotes -o quotes.json
   Output: Saves all extracted data to quotes.json
   Automatically: Handles requests, follows links, exports data

📚 Resources & Documentation - Organized by Library

🥄 Beautiful Soup Resources

Official Documentation:

Beautiful Soup Documentation - Complete official documentation
Beautiful Soup Quick Start - Getting started guide
CSS Selectors Reference - CSS selector syntax

Video Tutorials:

Beautiful Soup Tutorial - Corey Schafer
Web Scraping with Beautiful Soup - Tech With Tim
Beautiful Soup Complete Guide - freeCodeCamp

Articles & Tutorials:

Real Python - Beautiful Soup Guide - Comprehensive tutorial
GeeksforGeeks - Beautiful Soup - Step-by-step examples

🌐 Requests Library Resources

Official Documentation:

Requests Documentation - HTTP library documentation
Requests Quickstart - Basic usage examples
Advanced Usage - Sessions, cookies, SSL

Video Tutorials:

Requests Library Tutorial - Corey Schafer
HTTP Requests in Python - Programming with Mosh

Articles:

Requests vs urllib - When to use what
Session Objects in Requests - Persistent sessions

🕸️ Scrapy Framework Resources

Official Documentation:

Scrapy Documentation - Complete framework guide
Scrapy Tutorial - Step-by-step tutorial
Scrapy Best Practices - Production tips
Scrapy Shell - Interactive testing

Video Tutorials:

Scrapy Framework Tutorial - Traversy Media
Complete Scrapy Course - Coding Entrepreneurs
Scrapy vs Beautiful Soup - When to use what

Books & Courses:

“Learning Scrapy” by Dimitris Kouzis-Loukas - Packt
“Web Scraping with Python and Scrapy” - Udemy courses
Scrapy GitHub Examples - Official examples

🚗 Selenium Resources

Official Documentation:

Selenium Documentation - Official Python bindings
WebDriver API - Complete API reference
Waits in Selenium - Handling dynamic content
Selenium Grid - Distributed testing

Video Tutorials:

Selenium WebDriver with Python - Programming with Mosh
Selenium Complete Course - Edureka
Selenium with Python Tutorial - Telusko

Articles & Guides:

Real Python - Selenium Guide - Modern web automation
Selenium Best Practices - Official best practices
Handling Dynamic Content - WebDriverWait examples

⚡ Parallel Processing & Advanced Topics

Joblib Resources:

Joblib Documentation - Official documentation
Parallel Computing with Joblib - Parallel processing guide
Joblib vs Multiprocessing - When to use what

Multiprocessing & Threading:

Python Multiprocessing - Official documentation
Concurrent.futures - High-level interface
Threading vs Multiprocessing - Real Python guide

Performance & Optimization:

Web Scraping Performance - Apify blog
Async Web Scraping - ScrapeHero guide

🛠️ General Web Scraping Resources

Practice Websites:

Quotes to Scrape - Perfect for beginners
Books to Scrape - E-commerce practice
Scrape This Site - Various challenges
HTTP Bin - HTTP testing service

Alternative Libraries:

requests-html - JavaScript support for requests
playwright-python - Modern browser automation
httpx - Next-generation HTTP client
pyppeteer - Puppeteer port for Python

Data Processing:

Pandas Documentation - Data manipulation and analysis
NumPy User Guide - Numerical computing
Matplotlib Tutorials - Data visualization

📖 Books & Comprehensive Courses

Recommended Books:

“Web Scraping with Python” by Ryan Mitchell - O’Reilly Media (Classic)
“Python Web Scraping Cookbook” by Michael Heydt - Packt
“Mastering Python Web Scraping” - Advanced techniques

Online Courses:

freeCodeCamp Web Scraping - 3+ hour complete course
Udemy Web Scraping Courses - Various paid courses
Coursera Web Scraping - University courses

🎯 Learning Path Recommendations

Beginner (1-2 weeks):

HTML/CSS basics
Beautiful Soup fundamentals
Simple scraping projects
Practice websites

Intermediate (2-4 weeks):

Selenium for dynamic content
Error handling and robustness
Data processing with pandas
Multiple page scraping

Advanced (4+ weeks):

Scrapy framework
Parallel processing
Large-scale projects
Production deployment

🎲 00

▶️Video
🔗Random link
🇦🇲🎶
🌐🎶
🤌Կարգին

📌 Նկարագիր

📺 Տեսանյութեր

🏡 Տնային

📚 Նյութը

🌐 HTML Basics - Understanding Web Structure

What is HTML?

Basic HTML Document Structure:

Common HTML Tags for Scraping:

Document Structure:

Text Content:

Containers:

Lists and Links:

Data Tables:

HTML Attributes - The Key to Scraping

🎯 CSS Selectors - Your Scraping Toolkit

Basic Selectors:

1. Element Selector:

2. Class Selector (starts with .):

3. ID Selector (starts with #):

4. Attribute Selector:

Advanced CSS Selectors:

Combination Selectors:

Multiple Classes:

Pseudo-selectors:

Complex Examples:

🥄 Beautiful Soup - Your HTML Parsing Companion

Why Beautiful Soup?

Core Concepts:

🌐 Real Website Scraping Example

🎯 Advanced Beautiful Soup Techniques

1. Different Parsing Methods:

🚗 Selenium - For Dynamic and JavaScript-Heavy Websites

🤔 When Do You Need Selenium?

Signs You Need Selenium:

What Selenium Does:

⚡ Selenium vs Beautiful Soup Comparison:

🛠️ Selenium Installation & Setup

Step 1: Install Selenium

Step 2: Understanding WebDrivers

Step 3: Basic Setup Options

Step 4: Common Chrome Options

🚀 Parallel Web Scraping & Multiprocessing

🧠 Why Use Parallel Processing?

🔧 Basic Multiprocessing Concepts

📦 Introduction to Joblib

🌐 Parallel Web Scraping Examples

🛡️ Advanced Parallel Scraping with Rate Limiting

⚡ Performance Optimization & Best Practices

🎯 Choosing the Right Approach

🛡️ Rate Limiting Strategies

🕸️ Scrapy Framework - Industrial-Strength Web Scraping

🤔 When to Choose Scrapy vs Beautiful Soup?

🏗️ Scrapy Architecture

🚀 Getting Started with Scrapy

Installation:

Creating a Scrapy Project:

Key Files Explained:

🕷️ Understanding Scrapy Spiders

Basic Spider Structure:

The response Object:

📚 Resources & Documentation - Organized by Library

🥄 Beautiful Soup Resources

Official Documentation:

Video Tutorials:

Articles & Tutorials:

🌐 Requests Library Resources

Official Documentation:

Video Tutorials:

Articles:

🕸️ Scrapy Framework Resources

Official Documentation:

Video Tutorials:

Books & Courses:

🚗 Selenium Resources

Official Documentation:

Video Tutorials:

Articles & Guides:

⚡ Parallel Processing & Advanced Topics

Joblib Resources:

Multiprocessing & Threading:

The `response` Object: