get_html_content

The get_html_content function is designed to scrape the HTML content from a given URL using Selenium WebDriver. It provides options to wait for a specific element to be present on the page or to wait for a specified amount of time before retrieving the page source.

Parameters

url (str): The URL of the web page you want to scrape.
element_name (str, optional): The name of the element to wait for before retrieving the page source. Defaults to None.
by (By, optional): The method used to locate the element. Defaults to By.CLASS_NAME.
time_wait (int, optional): The maximum amount of time (in seconds) to wait for the element or page load. Defaults to 10.

Functionality

Set Up Chrome Options:
- Configures the Chrome driver to run in headless mode (no GUI).
Initialize Chrome Driver:
- Sets up the Chrome WebDriver with the specified options.
Open the URL:
- Navigates to the provided URL.
Wait for Element or Time:
- If element_name is provided, it waits until the element is located using the specified method (by).
- If no element is specified, it waits for the specified amount of time.
Retrieve Page Source:
- Fetches the HTML content of the page.
Close the Driver:
- Closes the WebDriver to clean up resources.

Example Usage

html_content = get_html_content(
    url="https://example.com",
    element_name="main-content",
    by=By.ID,
    time_wait=15
)

Previousget_markdown_data Nextconvert_html_to_markdown