convert_html_to_markdown
Parameters
html_content(str): The HTML content that you want to convert into Markdown.
Functionality
The convert_html_to_markdown function performs the following tasks:
Create HTML2Text Object: Utilizes the
html2textlibrary to convert HTML into Markdown.Ignore Links: By default, links are not ignored in the conversion. This can be adjusted by setting the
ignore_linksproperty of thehtml2textobject.Remove Image Data:
Removes standard HTML image tags (
<img>) from the content.Removes any base64-encoded image data (PNG format) embedded in the HTML content.
Convert HTML to Markdown: Converts the cleaned HTML content into Markdown format using the
html2textobject.
Example
html_content = """
<h1>Title</h1>
<p>This is a paragraph with an <a href="http://example.com">example link</a>.</p>
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAAC2...">
"""
markdown_content = convert_html_to_markdown(html_content)
print(markdown_content)Output:
In this example, the function converts an HTML snippet into Markdown, stripping out image data and preserving text and links.