paint-brush
Let's Build a Free Web Scraping Tool That Combines Proxies and AI for Data Analysisby@aviatorscode2
513 reads
513 reads

Let's Build a Free Web Scraping Tool That Combines Proxies and AI for Data Analysis

by Victor Yakubu15mDecember 17th, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Learn how to combine web scraping, proxies, and AI-powered language models to automate data extraction and gain actionable insights effortlessly.
featured image - Let's Build a Free Web Scraping Tool That Combines Proxies and AI for Data Analysis
Victor Yakubu HackerNoon profile picture
0-item
1-item
2-item

While some websites are straightforward to scrape by using just Selenium, Puppeteer, and the like, other websites that implement advanced security measures such as CAPTCHAs and IP bans may prove difficult. To overcome these challenges and ensure you can scrape 99% of websites for free using the Scraper, you will be building this in this article, and you will be integrating a proxy tool in your code that will help bypass these security measures.


However, collecting the data is just one step; what you do with that data is equally, if not more, important. Often, this requires painstakingly sifting through large volumes of information manually. But what if you could automate this process? By leveraging a language model (LLM), you can not only collect data but also query it to extract meaningful insights—saving time and effort.


In this guide, you’ll learn how to combine web scraping with AI to build a powerful tool for collecting and analyzing data at scale for free. Let’s dive in!

Prerequisites

Before you begin, ensure you have the following:

  1. Basic Python knowledge, as this project involves writing and understanding Python code.
  2. Install Python (3.7 or later) on your system. You can download it from python.org.

Installation and Setup

To continue with this tutorial, complete the following steps:

Follow these steps to set up your environment and prepare for building the AI-powered scraper.

1. Create a Virtual Environment

First, set up a virtual environment to manage your project’s dependencies. This will ensure you have an isolated space for all the required packages.


  1. Create a new project directory:

    Open your terminal (or Command Prompt/PowerShell on Windows) and create a new directory for your project:

    mkdir ai-website-scraper
    
    cd ai-website-scraper
    


  2. Create the virtual environment:

Run the following command to create the virtual environment:


  • On Windows:

    python -m venv venv
    
  • On macOS/Linux:

    python3 -m venv venv
    


This creates a venv folder that will store the virtual environment.


2. Activate the Virtual Environment

Activate the virtual environment to begin working within it:


  • On Windows:

    .\venv\Scripts\activate
    
  • On macOS/Linux:

    source venv/bin/activate
    


Your terminal prompt will change to show (venv), confirming you're now inside the virtual environment.

3. Install Required Dependencies

Now, install the libraries your project needs. Create a requirements.txt file in your project directory and add the following dependencies:


streamlit
selenium
Beautifulsoup4
langchain
langchain-ollama
lxml 
html5lib


These packages are essential for scraping, data processing, and building the UI:

  • streamlit: This is used to create the interactive user interface.

  • Selenium: For scraping website content.

  • beautifulsoup4: For parsing and cleaning the HTML.

  • langchain and langchain-ollama: This is for integrating with the Ollama LLM and processing text.

  • lxml and html5lib: For advanced HTML parsing.


Install the dependencies by running the following command:

(Ensure that you are in the folder where the file is located before running the command.)


pip install -r requirements.txt


Building the UI with Streamlit

Streamlit makes it easy to create an interactive user interface (UI) for Python applications. In this section, you will build a simple, user-friendly interface where users can input a URL and display the scraped data.

1. Set Up the Streamlit Script

Create a file named ui.py in your project directory. This script will define the UI for your scraper. Use the code below to structure your application:

import streamlit as st
import pathlib
from main import scrape_website

# function to load css from the assets folder
def load_css(file_path):
    with open(file_path) as f:
        st.html(f"<style>{f.read()}</style>")

# Load the external CSS
css_path = pathlib.Path("assets/style.css")
if css_path.exists():
    load_css(css_path)

st.title("AI Scraper")

st.markdown(
    "Enter a website URL to scrape, clean the text content, and display the result in smaller chunks."
)

url = st.text_input(label= "", placeholder="Enter the URL of the website you want to scrape")

if st.button("Scrape", key="scrape_button"):
    st.write("scraping the website...")
    result = scrape_website(url)
    st.write("Scraping complete.")
    st.write(result)


  • The st.title and st.markdown functions set up the application title and provide instructions for users.
  • The st.text_input component lets users input the URL of the website they want to scrape.
  • Clicking the "Scrape" button triggers the scraping logic, displaying progress messages using st.info.


You can learn more about streamlit components from their documentation.

2. Add Custom Styles

To style your application, create an assets folder in your project directory and add a style.css file. Customize the Streamlit interface with CSS:

.stAppViewContainer {
    background-image: url("https://images.unsplash.com/photo-1732979887702-40baea1c1ff6?q=80&w=2832&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D");
    background-size: cover;
    color: black;
}

.stAppHeader {
    background-color: rgba(0, 0, 0, 0);
}

.st-ae {
    background-color: rgba(233, 235, 234, 0.895);
}

.st-emotion-cache-ysk9xe {
    color: black;
}

.st.info, .stAlert {
    background-color: black;
}
.st-key-scrape_button button {
    display: inline-block;
    padding: 10px 20px;
    font-size: 16px;
    color: #fff;
    background-color: #007bff;
    border: none;
    border-radius: 5px;
    cursor: pointer;
    animation: pulse 2s infinite;
}

.st-key-scrape_button button:hover {
    background-color: #0056b3;
    color: #fff;
}


3. Run the Streamlit app

In your project directory, run the following command:

streamlit run ui.py


This will launch a local server, and you should see a URL in the terminal, usually http://localhost:8501. Open this URL in your browser to interact with the web application.

AI Scraper App

Scraping website with Selenium

Next, write the code to extract the HTML content of any webpage using Selenium. However, for the code to work, you need a Chrome WebDriver.

Install ChromeDriver for Selenium

Selenium requires a WebDriver to interact with web pages. Here’s how to set it up:

  1. Download ChromeDriver:
    Visit thisChromeDriver website and download the version matching your Google Chrome browser.
  2. Add ChromeDriver to PATH


After downloading ChromeDriver, extract the file and copy the application file name “chromedriver” and paste it into your project folder.

When this is done, create a new file called main.py and implement the code below:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Function to scrape HTML from a website
def scrape_website(website_url):
    # Path to WebDriver
    webdriver_path = "./chromedriver"  # Replace with your WebDriver path
    service = Service(webdriver_path)
    driver = webdriver.Chrome(service=service)

    try:
        # Open the website
        driver.get(website_url)

        # Wait for the page to fully load
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body")))

        # Extract the HTML source
        html_content = driver.page_source
        return html_content

    finally:
        # Ensure the browser is closed after scraping
        driver.quit()


Save and run the code; you should get all the HTML of the page you scraped displayed in your streamlit application like this:

Using a Proxy Provider to Bypass website with Captcha and IP Bans

While you can now retrieve the HTML of a website, the above code may not work for sites with advanced anti-scraping mechanisms such as CAPTCHA challenges or IP bans. For example, scraping a site like Indeed or Amazon using Selenium may result in a CAPTCHA page blocking access. This happens because the website detects that a bot is trying to access its content. If this behaviour persists, the site may eventually ban your IP address, preventing further access.


Captcha


To fix this, integrate Bright Data’s Scraping Browser into your script. The scraping browser is a robust tool that leverages multiple proxy networks, including residential IPs, to bypass anti-scraping defenses. It handles unblocking pages by managing custom headers, browser fingerprinting, CAPTCHA solving, and more. This ensures that your scraping efforts remain undetected while accessing content seamlessly.

Setting up Bright Data’s Scraping Browser for free

  1. Signing up — go to Bright Data’s homepage and click on “Start Free Trial”. If you already have an account with Bright Data, you can just log in.

  2. After logging in, click on “Get Proxy Products”.


  3. Click on the “Add” button and select “Scraping Browser.”


  4. Next, you will be taken to the “Add zone” page, where you will be required to choose a name for your new scraping browser proxy zone. After that, click on “Add”.


  5. After this, your proxy zone credentials will be created. You will need these details in your script to bypass any anti-scraping mechanisms used on any website.


You can also check out Bright Data’s developer documentation for more details about the scraping browser.


In your main.py file, change the code to this. You will notice that this code is cleaner and shorter than the previous code.


from selenium.webdriver import Remote, ChromeOptions
from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup


AUTH = '<username>:<passord>'
SBR_WEBDRIVER = f'https://{AUTH}@brd.superproxy.io:9515'

# Function to scrape HTML from a website
def scrape_website(website_url):
    print("Connecting to Scraping Browser...")
    sbr_connection = ChromiumRemoteConnection(SBR_WEBDRIVER, "goog", "chrome")
    with Remote(sbr_connection, options=ChromeOptions()) as driver:
        driver.get(website_url)
        print("Waiting captcha to solve...")
        solve_res = driver.execute(
            "executeCdpCommand",
            {
                "cmd": "Captcha.waitForSolve",
                "params": {"detectTimeout": 10000},
            },
        )
        print("Captcha solve status:", solve_res["value"]["status"])
        print("Navigated! Scraping page content...")
        html = driver.page_source
        return html


Replace <username> and <password> with your scraping browser username and password.

Cleaning the Dom content

After scraping the HTML content of a website, it’s often filled with unnecessary elements such as JavaScript, CSS styles, or unwanted tags that do not contribute to the core information you’re extracting. To make the data more structured and useful for further processing, you need to clean the DOM content by removing irrelevant elements and organizing the text.


This section explains how to clean the HTML content, extract meaningful text, and split it into smaller chunks for downstream processing. The cleaning process is essential for preparing data for tasks like natural language processing or content analysis.

Code Walkthrough for Cleaning DOM Content

Here’s the code that will be added to main.py to handle cleaning the DOM content:


from bs4 import BeautifulSoup

# Extract the body content from the HTML
def extract_body_content(html_content):
    soup = BeautifulSoup(html_content, "html.parser")
    body_content = soup.body
    if body_content:
        return str(body_content)
    return ""

# Clean the body content by removing scripts, styles, and other unwanted elements
def clean_body_content(body_content):
    soup = BeautifulSoup(body_content, "html.parser")

    # Remove <script> and <style> tags
    for script_or_style in soup(["script", "style"]):
        script_or_style.extract()

    # Extract cleaned text with each line separated by a newline
    cleaned_content = soup.get_text(separator="\n")
    cleaned_content = "\n".join(
        line.strip() for line in cleaned_content.splitlines() if line.strip()
    )

    return cleaned_content

# Split the cleaned content into smaller chunks for processing
def split_dom_content(dom_content, max_length=5000):
    return [
        dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length)
    ]


What the Code Does

  1. Extracting the Body Content:
    • The extract_body_content function uses BeautifulSoup to parse the HTML and extract the <body> tag's content.
    • If a <body> tag exists, the function returns it as a string. Otherwise, it returns an empty string.
  2. Cleaning the Content:
    • The clean_body_content function processes the extracted content to remove unnecessary elements:
      • <script> and <style> tags are removed to eliminate JavaScript and CSS.
      • The function retrieves the plain text from the cleaned content.
      • It formats the text by stripping empty lines and extraneous spaces.
  3. Splitting the Content:
    • The split_dom_content function takes the cleaned content and splits it into smaller chunks with a default maximum length of 5,000 characters.
    • This is useful for processing large amounts of text in manageable pieces, especially when passing data to models with token or input size limits.


Save your changes and test your application. You should get an output like this after scraping a website.

Parsing the Dom content to Ollama

Once the DOM content is cleaned and prepared, the next step is parsing the information to extract specific details using Ollama, a large language model (LLM) integrated with LangChain. Ollama is a CLI tool used to download and run LLMs locally. However, before using Ollama, you have to do the following installations:


  • If you haven’t, download and install Ollama from the official website. You can install it on Mac using the command Homebrew.

    brew install ollama
    
  • Next, install any model from this list; there are models like Phi3, Mistral, Gemma 2, etc.; each has its own system requirements. This code uses the phi3 mainly because it's lightweight.

    ollama pull phi3
    


After installation, you can call on that model from your script using LangChain to provide meaningful insights from the data that will be sent to it.


Here’s how to set up the functionality to parse DOM content into phi3 model

Code walkthrough for llm.py

The following code implements the logic to parse DOM chunks with Ollama and extract relevant details:

from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate

# Template to instruct Ollama for parsing
template = (
    "You are tasked with extracting specific information from the following text content: {dom_content}. "
    "Please follow these instructions carefully: \n\n"
    "1. **Extract Information:** Only extract the information that directly matches the provided description: {parse_description}. "
    "2. **No Extra Content:** Do not include any additional text, comments, or explanations in your response. "
    "3. **Empty Response:** If no information matches the description, return an empty string ('')."
    "4. **Direct Data Only:** Your output should contain only the data that is explicitly requested, with no other text."
)

# Initialize the Ollama model
model = OllamaLLM(model="phi3")

# Function to parse DOM chunks with Ollama
def parse_with_ollama(dom_chunks, parse_description):
    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | model

    parsed_results = []

    for i, chunk in enumerate(dom_chunks, start=1):
        if not chunk.strip():  # Skip empty chunks
            print(f"Skipping empty chunk at batch {i}")
            continue

        try:
            print(f"Processing chunk {i}: {chunk[:100]}...")  # Print a preview
            print(f"Parse description: {parse_description}")
            response = chain.invoke(
                {
                    "dom_content": chunk,
                    "parse_description": parse_description,
                }
            )
            print(f"Response for batch {i}: {response}")
            parsed_results.append(response)
        except Exception as e:
            print(f"Error parsing chunk {i}: {repr(e)}")
            parsed_results.append(f"Error: {repr(e)}")

    return "\n".join(parsed_results)

What the code does.

  1. Instruction Template:
    • Provides precise guidance for Ollama on what information to extract.
    • Ensures the output is clean, concise, and relevant to the parsing description.
  2. Chunk Processing:
    • The parse_with_ollama function iterates through the DOM chunks, processing each with the LLM.
    • Skips empty chunks to optimize performance.
  3. Error Handling:
    • Handles errors gracefully, logs them, and continues processing remaining chunks.

Updating the file ui.py file

Add the following code to the ui.py file to allow users to input parsing instructions to the LLM and view results:

from main import scrape_website, extract_body_content, clean_body_content, split_dom_content
from llm import parse_with_ollama

if "dom_content" in st.session_state:
    parse_description = st.text_area("Enter a description to extract specific insights from your scraped data:")

    if st.button("Parse Content", key="parse_button"):
        if parse_description.strip() and st.session_state.get("dom_content"):
            st.info("Parsing the content...")
            dom_chunks = split_dom_content(st.session_state.dom_content)
            parsed_result = parse_with_ollama(dom_chunks, parse_description)
            st.text_area("Parsed Results", parsed_result, height=300)
        else:
            st.error("Please provide valid DOM content and a description to parse.")

How It Works in the UI

  1. User Input:
    • The user provides a natural language description of the data to extract in a text area.
  2. Parsing Trigger:
    • When the Parse Content button is clicked, the cleaned DOM content is split into manageable chunks and passed to parse_with_ollama.
  3. Results Display:
    • The parsed results are displayed in a text area, allowing users to review the extracted information.


With this done, the scraper can now provide responses to your prompts based on the data scraped.


What's next?

The combination of web scraping and AI opens up exciting possibilities for data-driven insights. Beyond collecting and saving data, you can now leverage AI to optimize the process of gaining insight from the data scraped. This is useful for marketing and sales teams, data analysis, business owners, and a lot more.


You can find the complete code for the AI scraper here. Feel free to experiment with it and adapt it to your unique needs. Contributions are also welcome—if you have ideas for improvements, consider creating a pull request!


You can also take this further. Here are some ideas:

  • Experiment with Prompts: Tailor your prompts to extract specific insights or address unique project requirements.
  • User Interface
  • Integrate other LLM Models: Explore other language models like OpenAI, Gemini, etc to further optimize your data analysis.