Scraping Twitter with Python
About
Web scraping publicly accessible data on Twitter is allowed in the united states. Meaning if you are able to see some data on Twitter without logging in then you are allowed to scrape it. That doesn't necessarily mean that Twitter will make it easy to do so or that scraping data from the website is the best way to source it. Below is a practical guide on how to scrape Twitter using Python. For this example we'll be searching for mentions of URLs and domains on Twitter, however we are doing this using generic search functionality and you can replace the terms with whatever suits you.
Why Scrape?
Twitter does provide an API to their data. I would definitely recommend using the Twitter API (v2 at the time of writing) if possible to get the data you need. There are some cases in which case the scope of your project may push you out of the bounds of the API limitations. The main issue with Twitter api v2 recent search in my case is that it returns tweets that are no older than 7 days (as per their migration documentation). for example my product needs to scrape tweets for 100k search terms, and many of them may have data that is older than 7 days. If that's the case then web scraping may be a better option especially to seed initial data. I would personally recommend to use the API for as much data as possible, especially if it fits your needs as the scraping solution is potentially flakey due to the way tweets are loaded on the page with javascript.
Using the API
You'll need to sign up for a developer account. Doing so through the developer portal is quite easy. You'll need a Twitter account. Once you navigate to the developer portal and input an app name you'll be given your keys immediately. Copy all three keys down somewhere safe. You can then use the sample code like shown below from https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/0d1587147d3bf8338b5fea2b4e5bb56d37f5c2b6/Recent-Search/recent_search.py.
import requests
import os
import json
# To set your environment variables in your terminal run the following line:
# export 'BEARER_TOKEN'='<your_bearer_token>'
bearer_token = os.environ.get("BEARER_TOKEN")
search_url = "https://api.twitter.com/2/tweets/search/recent"
# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields
query_params = {'query': '(from:twitterdev -is:retweet) OR #twitterdev','tweet.fields': 'author_id'}
def bearer_oauth(r):
"""
Method required by bearer token authentication.
"""
r.headers["Authorization"] = f"Bearer {bearer_token}"
r.headers["User-Agent"] = "v2RecentSearchPython"
return r
def connect_to_endpoint(url, params):
response = requests.get(url, auth=bearer_oauth, params=params)
print(response.status_code)
if response.status_code != 200:
raise Exception(response.status_code, response.text)
return response.json()
def main():
json_response = connect_to_endpoint(search_url, query_params)
print(json.dumps(json_response, indent=4, sort_keys=True))
if __name__ == "__main__":
main()
Using Scraping Bee
Why Use A Scraping Service
I've been using a scraping service for one of my other Plugin Factory Limited projects, Sose.app. A web scraping service abstracts away a lot of hard problems with scaling up web scraping. Gone are the days when you could use requests and beautifulsoup to get whatever you need. Many sites are quite difficult to scrape without running into captchas or being blocked. Scraping bee manages rotation of proxies and makes it easy to get the data you need without getting blocked during the scraping process.
Scraping bee is one of the easiest to use services and is what I have been using for Sose.app as their stealth proxies make it easy to scrape data that is otherwise unreachable to me, and their google API makes it easy to scrape google search data that is otherwise oddly structured and annoying to extract. If your core business is scraping then it may be worth investigating how to handle these problems on your own, however for most people especially while getting started those kinds of problems aren't worth tackling.
Furthermore because since 2020 Twitter only allows clients with javascript you'll at the very least need to use a headless browser. Scraping bee manages this for you behind their API, but if you want to do this yourself I would recommend puppeteer.
Using Scraping Bee with Twitter
Scraping bee is a service so this is all assuming you've signed up and have an API key available to you. Something that I include in this script is retries with exponential backoff. This is a feature that is missing from Scraping Bee. Note that scraping bee does not charge you credits for 4XX or 5XX requests. Before you use this you should install the scraping bee client with pip3 install scrapingbee
and set the SCRAPING_BEE_API_KEY
variable to your API key for scraping bee. You can find this script on github as well at https://gist.github.com/ameerkat/0b218d3552b6be47fa3bccdf43d2001b.
from scrapingbee import ScrapingBeeClient
import time
import logging
import json
SCRAPING_BEE_API_KEY = "RBUHWF4Y0ORC8RGXVRG07VNCBNFN3AH3083P3CHJKEF00HIFGQD2Z0BIMXD4C7AHF14S361H85NZ5TYF" # replace with your API key
class ScrappingBeeClientWrapper:
def __init__(self, client, client_config):
self.client = client
self.client_config = client_config
def get(self, url, params = {}):
retry_delay = self.client_config["retry_delay_ms"] / 1000.0
for i in range(self.client_config["max_retries"]):
try:
response = self.client.get(url, params=params)
if response.ok:
return response
except Exception as e:
logging.error("Woah! That request failed with:")
logging.error(e)
if i != self.client_config["max_retries"] - 1:
time.sleep(retry_delay)
retry_delay *= self.client_config["retry_delay_growth_factor"]
return response
client = ScrappingBeeClientWrapper(ScrapingBeeClient(api_key=SCRAPING_BEE_API_KEY), {
"max_retries": 5,
"retry_delay_ms": 2000,
"retry_delay_growth_factor": 2 # set to 1 to have delay be static
})
search_term="google.com"
target_url = f"https://twitter.com/search?q={search_term}&src=typed_query&f=live"
tweet_response = client.get(target_url, params = {
'render_js': 'True',
'window_height': 4320,
'wait': 5000,
# The JS scenario here is quite tricky as the site only keeps the
# last X tweets in the DOM. You have to capture the data, then
# scroll then capture the next chunk almost tweet by tweet. Our
# samples could actually be quite small though.
# 'js_scenario': {
# "instructions": [
# # scroll and wait and scroll and wait if possible to load
# # latest tweets. Figuring out when to stop scrolling
# # can be a little tricky. We might want to use frequency
# # to estimate based on the sample we get.
# ]
# },
'extract_rules':{
"tweets": {
"selector": "article[data-testid='tweet']",
"type": "list",
"output": {
"handle": "div[data-testid='User-Names'] a[tabindex='-1'] span",
"permalink": {
"selector": "div[data-testid='User-Names'] a[dir='auto']",
"output": "@href"
},
"time": {
"selector": "div[data-testid='User-Names'] time",
"output": "@datetime"
},
"text": "div[data-testid='tweetText']",
"replies": "div[data-testid='reply']",
"retweets": "div[data-testid='retweet']",
"likes": "div[data-testid='like']"
}
}
}
})
if tweet_response.status_code != 200:
print(f"Failed to get Twitter search page ({target_url}) with response code {tweet_response.status_code}")
print(tweet_response.content)
json_result = json.loads(tweet_response.content)
# This is optional, if you think something is wrong with the code above for example and you aren't getting
# the same output as you expect, try running this.
if not json_result["tweets"]:
print("Failed to find any tweets. Check screenshot to see if page loaded correctly.")
screenshot_response = client.get(target_url, params = {
'render_js': 'True', # they've changed it to have some redirect
'window_height': 4320,
'timeout': 20000,
'wait': 5000,
'screenshot': True
})
if not screenshot_response.ok:
logging.warning(f"Failed to get a screenshot of the target page {target_url}. {screenshot_response.content}")
else:
logging.warning(f"Writing screenshot to file.")
target_file = f"./twitter.png"
try:
with open(target_file, "wb") as f:
f.write(screenshot_response.content)
logging.warn(f"Wrote screenshot to file {target_file}")
except Exception as e:
logging.error(f"Failed to write screenshot due to exception {e}.")
else:
# do something with the response
print(json.dumps(json_result, indent=2))
Example output
{
"tweets": [
{
"handle": "@bizcommunityit",
"permalink": "",
"time": "2022-12-31T17:07:35.000Z",
"text": "Il diesel servito sopra i 2 euro: ecco tutti i rincari che scattano con il 2023 (compresi gas e accise)",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@abbassakr1970",
"permalink": "",
"time": "2022-12-31T17:07:31.000Z",
"text": "\u0634\u062c\u0631\u064a\u0627\u0646 \u0631\u0628\u0646\u0627 - Google Search",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@PundaChen",
"permalink": "",
"time": "2022-12-31T17:07:31.000Z",
"text": "",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@igorsushko",
"permalink": "",
"time": "2022-12-31T17:07:30.000Z",
"text": "You can download and share this video of #Reznikov addressing the Russian people with every Russian you know:",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@kondoucoffee",
"permalink": "",
"time": "2022-12-31T17:07:30.000Z",
"text": "\u3053\u3093\u3069\u3046\u30b3\u30fc\u30d2\u30fc(Google \u3067\u306e\u6295\u7a3f):",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@Tremendos30ok",
"permalink": "",
"time": "2022-12-31T17:07:30.000Z",
"text": "Escucha la mejor Radio, b\u00e1jate la AApp : #EnElAire estamos escuchando Tremendos Treinta Radio, las 24 hs la mejor m\u00fasica y la mejor programaci\u00f3n... Lo escuchas as\u00ed : AApp : https://play.google.com/store/apps/details?id=com.app.radiotremendostreinta\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@ayeyeyalode",
"permalink": "",
"time": "2022-12-31T17:07:27.000Z",
"text": "recomendo",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@miIkyney",
"permalink": "",
"time": "2022-12-31T17:07:26.000Z",
"text": "happy new year from me and @lakcyah new year new bph saatnya join projects http://bit.ly/OprecBE2023",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@xka4cia",
"permalink": "",
"time": "2022-12-31T17:07:24.000Z",
"text": "\u30d7\u30fc\u30c1\u30f3\u5927\u7d71\u9818 \u65b0\u5e74\u6f14\u8aac\u3067\u30a6\u30af\u30e9\u30a4\u30ca\u653b\u6483\u7d99\u7d9a\u3092\u5f37\u8abf(\u30c6\u30ec\u30d3\u671d\u65e5 ... - Yahoo!\u30cb\u30e5\u30fc\u30b9",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@rongibs90229001",
"permalink": "",
"time": "2022-12-31T17:07:23.000Z",
"text": "Still get ready, procrastinate no longer",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@rongibs90229001",
"permalink": "",
"time": "2022-12-31T17:07:22.000Z",
"text": "@sharadrahirril Zantac MDL2924 was dismissed, it will be appealed, this does not mean you can not file a claim",
"replies": "1",
"retweets": "",
"likes": ""
},
{
"handle": "@ArsyMaulana13",
"permalink": "",
"time": "2022-12-31T17:07:21.000Z",
"text": "Download this free cloud based mining app: https://play.google.com/store/apps/details?id=com.remint2.app\u2026 Use my referral code: Z1XR6GBH Get 10 free coins now!",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@jabbarshahid63",
"permalink": "",
"time": "2022-12-31T17:07:20.000Z",
"text": "https://images.app.goo.gl/sTJWT3r9BNvoGM5V7\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@AIIAmericanSupe",
"permalink": "",
"time": "2022-12-31T17:07:20.000Z",
"text": "HOMELANDER: WRATH OF THE EAGLE https://docs.google.com/document/d/1rfdNZZL6U8v5tq-opnPoQ3b_HMHuP10cCTQQ1cI-E08/edit?usp=drivesdk\u2026",
"replies": "",
"retweets": "1",
"likes": "1"
},
{
"handle": "@kenyatta_jomo",
"permalink": "",
"time": "2022-12-31T17:07:20.000Z",
"text": "",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@whereistheiss",
"permalink": "",
"time": "2022-12-31T17:07:18.000Z",
"text": "The International Space Station was passing over South Atlantic Ocean on Sat Dec 31 2022 17:07:16 GMT+0000",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@ce_Hazem",
"permalink": "",
"time": "2022-12-31T17:07:15.000Z",
"text": "\u0645\u063a\u0627\u0645\u0631\u0629 \u0627\u0644\u0639\u0642\u0644 \u0627\u0644\u0623\u0648\u0644\u0649 - \u062f\u0631\u0627\u0633\u0629 \u0641\u064a \u0627\u0644\u0623\u0633\u0637\u0648\u0631\u0629
\u0641\u0631\u0627\u0633 \u0627\u0644\u0633\u0648\u0627\u062d",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@TradeInves",
"permalink": "",
"time": "2022-12-31T17:07:11.000Z",
"text": "Tawang kan memang daerah rendah, di sebrangnya saja ada Polder Tawang, mungkin perlu di keduk lebih dalam agar mampu tampung air lebih banyak.",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@Charlotte_de_R",
"permalink": "",
"time": "2022-12-31T17:07:06.000Z",
"text": "\u30c1\u30a7\u30fc\u30f3\u30ea\u30f3\u30af\u30a6\u30a3\u30fc\u30af\u30ea\u30fc\u30e9\u30a6\u30f3\u30c9\u30a2\u30c3\u30d7:\u30d6\u30eb\u30fc\u30d9\u30ea\u30fc\u3001\u30ab\u30b5\u30b5\u30ae\u3001\u30ac\u30ea\u30ec\u30aa\u3001\u30d2\u30e5\u30fc\u30ba\u30b4\u30fc\u30eb\u30c9 https://translate.google.com/translate?hl=&sl=auto&tl=ja&u=https://www.bsc.news/post/chainlink-weekly-roundup-blueberry-magpie-galileo-fuse-gold\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@ICHILLINFILES",
"permalink": "",
"time": "2022-12-31T17:07:03.000Z",
"text": "ICHILLIN' Draw and La Luna have been nominated for NuguPromoter's top song of 2022. Vote for the girls on the Google Form https://docs.google.com/forms/d/e/1FAIpQLSeFKgZg2WhjHgEwABC_kWsOUsLFDVjZoeBZs1D9PLUS3ULaSQ/viewform?usp=sf_link\u2026 #ICHILLIN #\uc544\uc774\uce60\ub9b0 @ichillin_km @I_m_chillin",
"replies": "",
"retweets": "2",
"likes": "3"
},
{
"handle": "@papercliff_api",
"permalink": "",
"time": "2022-12-31T17:07:03.000Z",
"text": "anniversary \u00b7 crowds \u00b7 fatah \u00b7 gaza \u00b7 palestinians https://news.google.com/search?q=anniversary+crowds+fatah+gaza+palestinians\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@harrisburgers",
"permalink": "",
"time": "2022-12-31T17:07:01.000Z",
"text": "#Trending: New partnership will help expand broadband service to more areas in Pennsylvania - WHP Harrisburg #CumberlandCounty #PA #Pennsylvania Read More Here:",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@WSshaaban",
"permalink": "",
"time": "2022-12-31T17:07:01.000Z",
"text": "\u062d\u0633\u0627\u0628 \u0627\u0644\u0634\u064a\u062e: @khald_abo \u0642\u0646\u0627\u0629 \u0627\u0644\u064a\u0648\u062a\u064a\u0648\u0628: https://youtube.com/@wsshaaban2094 \u0642\u0646\u0627\u0629 \u0627\u0644\u0648\u0627\u062a\u0633 \u0627\u0628: https://chat.whatsapp.com/JTr7E3O2uFI3BAmyA0X5AZ\u2026 \u0642\u0646\u0627\u0629 \u0627\u0644\u062a\u0644\u062c\u0631\u0627\u0645: https://t.me/haila_alwaled \u0645\u0624\u0644\u0641\u0627\u062a \u0627\u0644\u0634\u064a\u062e: https://drive.google.com/drive/mobile/folders/1-1jwBKZxjR3sXYzPianCtKYckKtbhAOl\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@2littlewings",
"permalink": "",
"time": "2022-12-31T17:06:58.000Z",
"text": "We are collecting #arttherapy quotations for @ArtTherapyDUC ! Please suggest your favorites! https://forms.gle/cTvTKpDa8j8MFGSp7\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@LucyFlorees",
"permalink": "",
"time": "2022-12-31T17:06:57.000Z",
"text": "https://google.com/search?q=fritz+meinecke+rechts&rlz=1C1VDKB_deDE1002DE1002&oq=fritz+meinecke+rechts&aqs=chrome..69i57.13864j0j7&sourceid=chrome&ie=UTF-8#fpstate=ive&vld=cid:56351e08,vid:Sk7e4_yXBuU\u2026 https://augengeradeaus.net/2022/01/bundeswehr-entliess-seit-2016-mehr-als-200-rechtsextremisten/\u2026 hier f\u00fcr dich, ist auch manchmal sehr schwer Hintergr\u00fcnde nachvollziehen zu k\u00f6nnen. Wer 2022 nach so vielen kritischen Aussagen als Waffenliebhaber und ehemaliger Soldat zu den b\u00f6hsen onkelz mit dem rechten arm gehoben singt",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@TeamYouTube",
"permalink": "",
"time": "2022-12-31T17:06:56.000Z",
"text": "Hallo! Du solltest auch eine E-Mail mit den genauen Informationen bekommen haben. Hier findest du alle genauen Informationen zum Shortsfund: https://goo.gle/3Q9Mb0z",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@idino_u",
"permalink": "",
"time": "2022-12-31T17:06:47.000Z",
"text": "\ua51b π·π°πΏπΏπ π½π΄π ππ΄π°π πΈπΆπΈπΉ \u0e02\u0e2d\u0e1a\u0e04\u0e38\u0e13\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a\u0e1b\u0e35\u0e17\u0e35\u0e48\u0e1c\u0e48\u0e32\u0e19\u0e21\u0e32\u0e19\u0e30\u0e04\u0e30 \u0e41\u0e25\u0e49\u0e27\u0e1b\u0e35\u0e19\u0e35\u0e49\u0e21\u0e32\u0e2a\u0e23\u0e49\u0e32\u0e07\u0e04\u0e27\u0e32\u0e21\u0e17\u0e23\u0e07\u0e08\u0e33\u0e14\u0e35\u0e46 \u0e23\u0e48\u0e27\u0e21\u0e01\u0e31\u0e19\u0e2d\u0e35\u0e01\u0e40\u0e22\u0e49\u0e2d\u0e30\u0e40\u0e22\u0e2d\u0e30\u0e40\u0e25\u0e22\u0e19\u0e49\u0e32\u0e32\u0e32 \u2014 photo booth frame Link : https://drive.google.com/drive/folders/11FuUpH1er_yi3llzX2RgzwNqVkbPoxkc\u2026 \u2661 personal use only #\u0e15\u0e49\u0e32\u0e2b\u0e4c\u0e2d\u0e39\u0e4b #oueiija #LetsStart2023WithDAOU #\u0e41\u0e08\u0e01png",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@RannySnice",
"permalink": "",
"time": "2022-12-31T17:06:46.000Z",
"text": "#HappyNewYear2023 everyone!This #wallpaper is a request from a personal friend and than by chance I came across a post about a #MiSTerFPGA core in active development for NARC by @pr4m0d! Full Res: https://drive.google.com/file/d/19GOodcjsaYqURPj6XtVlUETAlmX8bV0T/view?usp=share_link\u2026 @RetroDriven's Script https://github.com/RetroDriven/MiSTerWallpapers\u2026",
"replies": "",
"retweets": "1",
"likes": ""
},
{
"handle": "@srvfpd_sanramon",
"permalink": "",
"time": "2022-12-31T17:06:44.000Z",
"text": "Hazardous Condition, 2670 COREY PL, SAN RAMON (12/31/2022 9:05:36 AM)",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@srvfpd_xmed",
"permalink": "",
"time": "2022-12-31T17:06:43.000Z",
"text": "Hazardous Condition, 2670 COREY PL, SAN RAMON (12/31/2022 9:05:36 AM)",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@srvfpd",
"permalink": "",
"time": "2022-12-31T17:06:43.000Z",
"text": "Hazardous Condition, 2670 COREY PL, SAN RAMON (12/31/2022 9:05:36 AM)",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@talkwithjio",
"permalink": "",
"time": "2022-12-31T17:06:40.000Z",
"text": "start 22k/bulan, no vpn + no renew FULL GARANSI Payment : Spay, Dana, Gopay, QRIS Testi : http://tiny.cc/warungjio Wa : https://bit.ly/3FvCXIt",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@tandtmichael",
"permalink": "",
"time": "2022-12-31T17:06:39.000Z",
"text": "RUN into ATT @ShopTysons they have the new #iphone IN STOCK #Bestdeals #BestNetwork no wait. #SwitchToday and Save. AT&T Store (703) 834-2530",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@diclofenace",
"permalink": "",
"time": "2022-12-31T17:06:38.000Z",
"text": "@ASIGoI Found this on some rocks in chikhaldara, Amravati Maharashtra are those significant",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@MikaxTaki",
"permalink": "",
"time": "2022-12-31T17:06:38.000Z",
"text": "uploaded the performance here :) 1080p https://drive.google.com/drive/u/5/folders/1LLu39uMNcAZJBCobqSUNOTF6apcVYRtD\u2026",
"replies": "2",
"retweets": "3",
"likes": "6"
},
{
"handle": "@renneboog_eva",
"permalink": "",
"time": "2022-12-31T17:06:37.000Z",
"text": "https://google.com/search?q&tbm=isch&ictx=1&tbs=rimg:CU75RV-3hZpzIghO-UVft4WacyoSCU75RV-3hZpzEarWMWg8oeJb#imgrc=YWt6zWUa-PeHSM\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@Stock_Market_Pr",
"permalink": "",
"time": "2022-12-31T17:06:37.000Z",
"text": "Mark Moss Predicts Regulatory Shakeup and End of #Crypto Bull ... - #Bitcoin News https://news.google.com/__i/rss/rd/articles/CBMifGh0dHBzOi8vbmV3cy5iaXRjb2luLmNvbS9tYXJrLW1vc3MtcHJlZGljdHMtcmVndWxhdG9yeS1zaGFrZXVwLWFuZC1lbmQtb2YtY3J5cHRvLWJ1bGwtcnVucy1idXQtYmVsaWV2ZXMtYml0Y29pbi13aWxsLWVuZHVyZS_SAQA?oc=5&utm_source=dlvr.it&utm_medium=twitter\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@EJMantillo",
"permalink": "",
"time": "2022-12-31T17:06:34.000Z",
"text": "NASA in 2023: A Look Ahead - NASA",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@Stock_Market_Pr",
"permalink": "",
"time": "2022-12-31T17:06:34.000Z",
"text": "Ex-Meta #Crypto head expects #Crypto winter to drag through 2024 - #CryptoSlate https://news.google.com/__i/rss/rd/articles/CBMiV2h0dHBzOi8vY3J5cHRvc2xhdGUuY29tL2V4LW1ldGEtY3J5cHRvLWhlYWQtZXhwZWN0cy1jcnlwdG8td2ludGVyLXRvLWRyYWctdGhyb3VnaC0yMDI0L9IBXWh0dHBzOi8vY3J5cHRvc2xhdGUuY29tL2V4LW1ldGEtY3J5cHRvLWhlYWQtZXhwZWN0cy1jcnlwdG8td2ludGVyLXRvLWRyYWctdGhyb3VnaC0yMDI0Lz9hbXA9MQ?oc=5&utm_source=dlvr.it&utm_medium=twitter\u2026",
"replies": "",
"retweets": "",
"likes": ""
},
{
"handle": "@EJMantillo",
"permalink": "",
"time": "2022-12-31T17:06:32.000Z",
"text": "Ombudsman indicts more public officials in 2022 - http://Philstar.com",
"replies": "",
"retweets": "",
"likes": ""
}
]
}
Limitations
What usually happens with this current implementation is that we load only the latest tweets (~40 in the above example, but it really depends). Scrolling on the page may be something you would think to try however when scrolling on the page, older tweets are removed from the DOM. Meaning they won't be in the final page. You actually get less tweets by scrolling than by just letting the page load initially. One thing that could be improved is to scroll the page, hook into the scroll action and query the dom to see if new tweets come into view, and then store those tweets in JS. The output of the custom javascript execution is available in the scraping bee response in the evaluate_results
key.
Monitoring
Scraping tasks can take quite a while, regardless of what tools you choose to use. Scraping tasks for sose take upwards of 40 hours to run in some situations. Monitoring your scripts as they run to see if errors occur can be cumbersome and a waste of time, which is why I built mon.sh which makes it easy to get notified when scripts fail or complete without having to babysit them or spend a bunch of time setting up monitoring. I built mon.sh literally to monitor web scraping scripts that run on my desktop while developing my other SaaS project. Using mon.sh is as simple as piping the output of your script to mon e.g. scrape_twitter.py |& mon
and mon will handle the rest! Check out our quickstart guide to get started if you're interested.