main tags rss
github x

How to still scrape millions of tweets with twscrape

Twitter is a great place to gather data and assess various trends. Many analytics teams have used this source for their models.

In February 2023, Twitter set unrealistic prices for its API, giving away crumbs of data for big bucks. Some started using libraries such as snscrape, which used web public APIs. But in April 2023, Twitter closed that option as well — making search only for authorized accounts.

But data can still be collected in much the same way as before using the authorised account approach.

Introduction to twscrape

Released in May 2023, it is a tool for scraping data from tweets. It collects data such as user profiles, follower lists and follower lists, likes and retweets, as well as keyword searches.

Getting Started with twscrape

Requirements: Python 3.10 or higher

Installing twscrape

pip install twscrape

Or development version with latest features:

pip install git+https://github.com/vladkens/twscrape.git

Adding accounts

twscrape needs Twitter accounts to work. Each account has a rather small limit on the use of APIs, after which some time is no way to make requests through that account. twscrape is designed to switch accounts when one of them is not available. In this way the data flow looks continuous to the user, although in fact the requests come from different accounts internally.

Accounts can be added in two ways, via software API or CLI command. Let’s use the CLI command:

# twscrape add_accounts <file_path> <line_format>
# line_format should have "username", "password", "email", "email_password" tokens
# tokens delimeter should be same as an file
twscrape add_accounts accounts.txt username:password:email:email_password

Note: It is possible to register a new account or buy on special websites, e.g. here.

You then have to go through the login procedure to get the tokens to request the API. It’s not a quick process, but it’s needed once after adding new accounts. Then the token is stored in the SQLite database and reused for subsequent queries.

twscrape login_accounts

Note: Not all accounts can pass authorisation because of the antifraud system. You can try logging into these accounts again later.

Using twscrape

You can use twscrape in two ways.

  1. Using an CLI (terminal) and receive JSON object
  2. Using Python API (usuful for custom data collection scripts)

Lets get some tweet details from CLI:

twscrape tweet_details 1674894268912087040

Result:

{
  "id": 1674894268912087000,
  "id_str": "1674894268912087040",
  "url": "https://twitter.com/elonmusk/status/1674894268912087040",
  "date": "2023-06-30 21:34:46+00:00",
  "user": {
    "id": 44196397,
    "id_str": "44196397",
    "url": "https://twitter.com/elonmusk",
    "username": "elonmusk",
    "displayname": "Elon Musk",
    "created": "2009-06-02 20:12:29+00:00",
    // ...
    "_type": "snscrape.modules.twitter.User"
  },
  "lang": "en",
  "rawContent": "This platform hit another all-time high in user-seconds last week"
  // ...
}

It’s that simple. The data format is almost the same as it was in snscrape. So if you already have some scripts to process the data, you can continue to use them without too much trouble.

Scrapping tweets from a text search query by Python API

Using the code below, we are scraping 5000 tweets between January 1, 2023, and May 31, 2023, with the keywords "elon musk". Then printing in console tweet id, tweet author and content.

import asyncio
from twscrape import API, gather
from twscrape.logger import set_log_level

async def main():
    api = API()

    q = "elon musk since:2023-01-01 until:2023-05-31"
    async for tweet in api.search(q, limit=5000):
        print(tweet.id, tweet.user.username, tweet.rawContent)


if __name__ == "__main__":
    asyncio.run(main())

A general execution time for the entire code could be anywhere between 5 mins — 10 mins, depending on the number of tweets fetched by your username or keyword query.

Working with raw API reponses

If you don’t have enough data provided by Tweet & User objects or want to get more insights from the data, then there is an option to use raw Twitter responses. Each method has a _raw version that returns the original data.

import asyncio
from twscrape import API, gather
from twscrape.logger import set_log_level

async def main():
    api = API()

    q = "elon musk since:2023-01-01 until:2023-05-31"
    async for rep in api.search_raw(q, limit=5000):
        # rep is httpx.Response object
        print(rep.status_code, rep.json())

if __name__ == "__main__":
    asyncio.run(main())

Or same from CLI:

twscrape search "elon musk since:2023-01-01 until:2023-05-31" --raw

List of available functions

Found a bug or need new feature? Fill free to open an issues


More examples of use can be found on the project's Github page.