python - Beautiful Soup filter function fails to find all rows of a table -


i trying parse large html document using python beautiful soup 4 library.

the page contains large table, structured so:

<table summary='foo'>     <tbody>         <tr>              bunch of data          </tr>         <tr>             more data          </tr>         .         .         .         100s of <tr> tags later     </tbody> </table> 

i have function evaluates whether given tag in soup.descendants of kind looking for. necessary because page large (beautifulsoup tells me document contains around 4000 tags). so:

def isrow(tag):     if tag.name == u'tr':         if tag.parent.parent.name == u'table' , \                 tag.parent.parent.has_attr('summary'):              return true 

my problem when iterate through soup.descendants, function returns true first 77 rows in table, when know <tr> tags continue on hundreds of rows.

is problem function or there don't understand how beautifulsoup generates collection of descendants? suspect might python or bs4 memory issue don't know how go troubleshooting it.

still more educated guess, i'll give try.

the way beautifulsoup parses html heavily depends on underlying parser. if don't specify explicitly, beautifulsoup choose 1 automatically based on internal ranking:

if don’t specify anything, you’ll best html parser that’s installed. beautiful soup ranks lxml’s parser being best, html5lib’s, python’s built-in parser.

in case, i'd try switch parsers , see results get:

soup = beautifulsoup(data, "lxml")  # needs lxml installed soup = beautifulsoup(data, "html5lib")  # needs html5lib installed soup = beautifulsoup(data, "html.parser")  # uses built-in html.parser 

Comments

Popular posts from this blog

node.js - Mongoose: Cast to ObjectId failed for value on newly created object after setting the value -

gradle error "Cannot convert the provided notation to a File or URI" -

python - NameError: name 'subprocess' is not defined -