python - Beautiful Soup filter function fails to find all rows of a table -
i trying parse large html document using python beautiful soup 4 library.
the page contains large table, structured so:
<table summary='foo'> <tbody> <tr> bunch of data </tr> <tr> more data </tr> . . . 100s of <tr> tags later </tbody> </table>
i have function evaluates whether given tag in soup.descendants
of kind looking for. necessary because page large (beautifulsoup tells me document contains around 4000 tags). so:
def isrow(tag): if tag.name == u'tr': if tag.parent.parent.name == u'table' , \ tag.parent.parent.has_attr('summary'): return true
my problem when iterate through soup.descendants
, function returns true
first 77 rows in table, when know <tr>
tags continue on hundreds of rows.
is problem function or there don't understand how beautifulsoup generates collection of descendants? suspect might python or bs4 memory issue don't know how go troubleshooting it.
still more educated guess, i'll give try.
the way beautifulsoup
parses html heavily depends on underlying parser. if don't specify explicitly, beautifulsoup
choose 1 automatically based on internal ranking:
if don’t specify anything, you’ll best html parser that’s installed. beautiful soup ranks lxml’s parser being best, html5lib’s, python’s built-in parser.
in case, i'd try switch parsers , see results get:
soup = beautifulsoup(data, "lxml") # needs lxml installed soup = beautifulsoup(data, "html5lib") # needs html5lib installed soup = beautifulsoup(data, "html.parser") # uses built-in html.parser
Comments
Post a Comment