Python Tutorial

Saturday, January 28, 2012

Python Beautiful Soup Url extract from web page

Python Beautiful Soup Url extract from web page


from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
import urllib2

def get_url_content(site_url):
    rt=""
    try:
        request = urllib2.Request(site_url) 
        f=urllib2.urlopen(request)
        content=f.read()
        f.close()
    except urllib2.HTTPError, error:
        content=str(error.read())
    return content

response=get_url_content('http://www.sust.edu/')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        print link['href']




Output:

All urls under this link

Beautiful Soup Python : Install

Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

pip command


pip install beautifulsoup4

Install Steps:


- Download library from here
- Then extract the file.
- cd to this file directory from command prompt.
- run command python setup.py install