html - How to parse code after it has been stripped of styles and elements in python -
this basic question regarding html parsing:
i new python(coding,computer science, etc), teaching myself parse html , have imported both pattern , beautiful soup modules parse with. found code on internet cut out formatting.
import requests import json import urllib lxml import etree pattern import web bs4 import beautifulsoup url = "http://webrates.truefx.com/rates/connect.html?f=html" html = urllib.urlopen(url).read() soup = beautifulsoup(html) # kill script , style elements script in soup(["script", "style"]): script.extract() # rip out # text text = soup.get_text() # break lines , remove leading , trailing space on each lines = (line.strip() line in text.splitlines()) # break multi-headlines line each chunks = (phrase.strip() line in lines phrase in line.split(" ")) # drop blank lines text = '\n'.join(chunk chunk in chunks if chunk) print(text)
this produces following output:
eur/usd14265522866931.056661.056751.056081.057911.05686usd/jpy1426552286419121.405121.409121.313121.448121.382gbp/usd14265522866821.482291.482361.481941.483471.48281eur/gbp14265522865290.712790.712900.712300.713460.71273usd/chf14265522866361.008041.008291.006551.008791.00682eur/jpy1426552286635128.284128.296128.203128.401128.280eur/chf14265522866551.065121.065441.063491.066281.06418usd/cad14265522864891.278211.278321.276831.278531.27746aud/usd14265522864960.762610.762690.761150.764690.76412gbp/jpy1426552286682179.957179.976179.854180.077179.988
now point how can parse further if want string 'usd/chf' or particular point of data?
is there easier method webscrape , parse with? suggestions great!
system specs: windows 7 64bit ide: idle python: 2.7.5
thank in advance, rusty
keep simple. find cell by text (usd/chf
, example) , following siblings:
text = 'usd/chf' cell = soup.find('td', text=text) td in cell.next_siblings: print td.text
prints:
1426561775912 1.00 768 1.00 782 1.00655 1.00879 1.00682
Comments
Post a Comment