html - How to parse code after it has been stripped of styles and elements in python -

- August 15, 2011

this basic question regarding html parsing:

i new python(coding,computer science, etc), teaching myself parse html , have imported both pattern , beautiful soup modules parse with. found code on internet cut out formatting.

import requests import json import urllib lxml import etree pattern import web bs4 import beautifulsoup   url = "http://webrates.truefx.com/rates/connect.html?f=html" html = urllib.urlopen(url).read() soup = beautifulsoup(html)  # kill script , style elements script in soup(["script", "style"]):     script.extract()    # rip out  # text text = soup.get_text()  # break lines , remove leading , trailing space on each lines = (line.strip() line in text.splitlines()) # break multi-headlines line each chunks = (phrase.strip() line in lines phrase in line.split("  ")) # drop blank lines text = '\n'.join(chunk chunk in chunks if chunk)   print(text)

this produces following output:

eur/usd14265522866931.056661.056751.056081.057911.05686usd/jpy1426552286419121.405121.409121.313121.448121.382gbp/usd14265522866821.482291.482361.481941.483471.48281eur/gbp14265522865290.712790.712900.712300.713460.71273usd/chf14265522866361.008041.008291.006551.008791.00682eur/jpy1426552286635128.284128.296128.203128.401128.280eur/chf14265522866551.065121.065441.063491.066281.06418usd/cad14265522864891.278211.278321.276831.278531.27746aud/usd14265522864960.762610.762690.761150.764690.76412gbp/jpy1426552286682179.957179.976179.854180.077179.988

now point how can parse further if want string 'usd/chf' or particular point of data?

is there easier method webscrape , parse with? suggestions great!

system specs: windows 7 64bit ide: idle python: 2.7.5

thank in advance, rusty

keep simple. find cell by text (usd/chf, example) , following siblings:

text = 'usd/chf' cell = soup.find('td', text=text) td in cell.next_siblings:     print td.text

prints:

1426561775912 1.00 768 1.00 782 1.00655 1.00879 1.00682

Search This Blog

Sort

html - How to parse code after it has been stripped of styles and elements in python -

Comments

Post a Comment

Popular posts from this blog

node.js - Mongoose: Cast to ObjectId failed for value on newly created object after setting the value -

c# - ItextSharp font color issue in ver 5.5.4+ -

how does one get csharp-sqlite to throw exceptions for duplicates or foreign key constraint violations -