get_html_content

The get_html_content function is designed to scrape the HTML content from a given URL using Selenium WebDriver. It provides options to wait for a specific element to be present on the page or to wait for a specified amount of time before retrieving the page source.

Parameters

  • url (str): The URL of the web page you want to scrape.

  • element_name (str, optional): The name of the element to wait for before retrieving the page source. Defaults to None.

  • by (By, optional): The method used to locate the element. Defaults to By.CLASS_NAME.

  • time_wait (int, optional): The maximum amount of time (in seconds) to wait for the element or page load. Defaults to 10.

Functionality

  1. Set Up Chrome Options:

    • Configures the Chrome driver to run in headless mode (no GUI).

  2. Initialize Chrome Driver:

    • Sets up the Chrome WebDriver with the specified options.

  3. Open the URL:

    • Navigates to the provided URL.

  4. Wait for Element or Time:

    • If element_name is provided, it waits until the element is located using the specified method (by).

    • If no element is specified, it waits for the specified amount of time.

  5. Retrieve Page Source:

    • Fetches the HTML content of the page.

  6. Close the Driver:

    • Closes the WebDriver to clean up resources.

Example Usage

html_content = get_html_content(
    url="https://example.com",
    element_name="main-content",
    by=By.ID,
    time_wait=15
)