convert_html_to_markdown

Parameters

  • html_content (str): The HTML content that you want to convert into Markdown.

Functionality

The convert_html_to_markdown function performs the following tasks:

  1. Create HTML2Text Object: Utilizes the html2text library to convert HTML into Markdown.

  2. Ignore Links: By default, links are not ignored in the conversion. This can be adjusted by setting the ignore_links property of the html2text object.

  3. Remove Image Data:

    • Removes standard HTML image tags (<img>) from the content.

    • Removes any base64-encoded image data (PNG format) embedded in the HTML content.

  4. Convert HTML to Markdown: Converts the cleaned HTML content into Markdown format using the html2text object.

Example

html_content = """
<h1>Title</h1>
<p>This is a paragraph with an <a href="http://example.com">example link</a>.</p>
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAAC2...">
"""

markdown_content = convert_html_to_markdown(html_content)

print(markdown_content)

Output:

# Title

This is a paragraph with an [example link](http://example.com).

In this example, the function converts an HTML snippet into Markdown, stripping out image data and preserving text and links.