How do I grab all the links within an element in HTML using python? -
first, please check image below can better explain question:
i trying take user input select 1 of links below "course search term".... (ie. winter 2015).
the html opened shows part of code webpage. grab href links in element , consists of 5 term links want. following instructions website (www.gregreda.com/2013/03/03/web-scraping-101-with-python/), doesn't explain part. here code have been trying.
from bs4 import beautifulsoup urllib2 import urlopen base_url = "http://classes.uoregon.edu/" def get_category_links(section_url): html = urlopen(section_url).read() soup = beautifulsoup(html, "lxml") pldefault = soup.find("td", "pldefault") ul_links = pldefault.find("ul") category_links = [base_url + ul.a["href"] in ul_links.findall("ul")] return category_links
any appreciated! thanks. or if see website, classes.uoregon.edu/
i keep simple , locate links containing 2015
in text , term
in href
:
for link in soup.find_all("a", href=lambda href: href , "term" in href, text=lambda text: text , "2015" in text): print link["href"]
prints:
/pls/prod/hwskdhnt.p_search?term=201402 /pls/prod/hwskdhnt.p_search?term=201403 /pls/prod/hwskdhnt.p_search?term=201404 /pls/prod/hwskdhnt.p_search?term=201406 /pls/prod/hwskdhnt.p_search?term=201407
if want full urls, use urlparse.urljoin()
join links base url:
from urlparse import urljoin ... link in soup.find_all("a", href=lambda href: href , "term" in href, text=lambda text: text , "2015" in text): print urljoin(url, link["href"])
this print:
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201402 http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201403 http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201404 http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201406 http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201407
Comments
Post a Comment