python - Heroku Sporadic High Response Time -
this specific, try brief:
we running django app on heroku. 3 servers:
- test (1 web, 1 celery dyno)
- training (1 web, 1 celery dyno)
- prod (2 web, 1 celery dyno).
we using gunicorn gevents , 4 workers on each dyno.
we experiencing sporadic high service times. here example logentries:
high response time: heroku router - - at=info method=get path="/accounts/login/" dyno=web.1 connect=1ms service=6880ms status=200 bytes=3562
i have been googling weeks now. unable reproduce @ experience these alerts 0 5 times day. notable points:
- occurs on 3 apps (all running similar code)
- occurs on different pages, including simple pages such 404 , /admin
- occurs @ random times
- occurs varying throughput. 1 of our instances drives 3 users/day. not related sleeping dynos because ping new relic , issue can occur mid-session
- unable reproduce @ will. have experienced issue once. clicking page executes in 500ms resulted in 30 second delay , app error screen heroku's 30s timeout
- high response times vary 5000ms - 30000ms.
- new relic not point specific issue. here past few transactions , times:
- regexurlresolver.resolve
4,270ms
- sessionmiddleware.process_request
2,750ms
- render login.html
1,230ms
- wsgihandler
1,390ms
- the above simple calls , not take near amount of time
- regexurlresolver.resolve
what have narrowed down to:
this article on gunicorn , slow clients- i have seen issue happen slow clients @ our office have fiber connection.
gevent , async workers not playing nicely- we've switched gunicorn sync workers , problem still persists.
- gunicorn worker timeout
- it's possible workers somehow being kept-alive in null state.
insufficient workers / dynos- no indication of cpu/memory/db overutilization , new relic doesn't display indication of db latency
- noisy neighbors
- among multiple emails heroku, support rep has mentioned @ least 1 of long requests due noisy neighbor, not convinced issue.
subdomain 301- the requests coming through fine, getting stuck randomly in application.
dynos restarting- if case, many users affected. also, can see our dynos have not restarted recently.
- heroku routing / service issue
- it possible heroku service less advertised , downside of using service.
we have been having issue past few months, scaling needs fixed. any ideas appreciated have exhausted every or google link.
i have been in contact heroku support team on past 6 months. has been long period of narrowing down through trial/error, have identified problem.
i noticed these high response times corresponded sudden memory swap, , though paying standard dyno (which not idling), these memory swaps taking place when app had not received traffic recently. clear looking @ metrics charts not memory leak because memory plateau off. example:
after many discussions support team, provided explanation:
essentially, happens backend runtimes end combination of applications end using enough memory runtime has swap. when happens, random set of dyno containers on runtime forced swap arbitrarily small amounts (note "random" here containers memory hasn't been accessed still resident in memory). @ same time, apps using large amounts of memory end swapping heavily, causes more iowait on runtime normal.
we haven't changed how tightly pack runtimes @ since issue started becoming more apparent, our current hypothesis issue may coming customers moving versions of ruby prior 2.1 2.1+. ruby makes huge percentage of applications run on our platform , ruby 2.1 made changes it's gc trades memory usage speed (essentially, gcs less speed gains). results in notable increase in memory usage application moving older versions of ruby. such, same number of ruby apps maintained memory usage level before start requiring more memory usage.
that phenomenon combined misbehaving applications have resource abuse on platform hit tipping point got situation see dynos shouldn't swapping are. have few avenues of attack we're looking into, lot of above still little bit speculative. know sure of being caused resource abusive applications though , that's why moving performance-m or performance-l dynos (which have dedicated backend runtimes) shouldn't exhibit problem. memory usage on dynos application's. so, if there's swap it'll because application causing it.
i confident issue , others have been experiencing, related architecture , not combination of language/framework/configs.
there doesn't seem solution other a) tough , wait out or b) switch 1 of dedicated instances
i aware of crowd says "this why should use aws", find benefits heroku offers outweigh occasional high response times , pricing has gotten better on years. if suffering same issue, "best solution" choice. update answer when hear more.
good luck!
Comments
Post a Comment