From May 13 to 15, 2022, sparktoro and followerwonk conducted a rigorous joint analysis of 44058 public twitter accounts active in the past 90 days. More than 300 million people are randomly selected from these accounts The analysis found that 19.42% met the conservative definition of false or spam accounts Details and methods are shown in the full report below.
For the past three years, sparktoro has been running a free twitter profile tool called fake followers. Over the past month, many media and other curious parties have used the tool to analyze fans of Elon Musk, a potential twitter acquirer. On Friday, musk tweeted that his acquisition of twitter had been "Shelved" because some questioned how many Twitter users had fake or spam accounts.
Sparktoro is a small team of only three people, and fake followers aims to conduct informal free research (the actual business is audience research software). However, in view of the significant public interest, sparktoro and the twitter research tool followerwonk (whose owner Marc MIMS is a long-term friend) jointly conducted a rigorous analysis and answer.
-What is spam or a fake twitter account?
-What percentage of active twitter accounts are spam or fake accounts?
-What percentage of Musk's fans are spam, fake or inactive accounts?
-Why should sparktoro's method be trusted?
Sparktoro discusses these issues one by one below.
What is spam or a fake twitter account
Sparktoro's definition (which may be different from Twitter's own definition) can best be described as follows.
"Spam or fake twitter accounts are those accounts where humans don't often write their own tweets, consume their time line activities, or participate in the twitter ecosystem."
Many "fake" accounts under this definition are neither evil nor problematic. For example, a considerable number of users find that they pay attention to @ newsymbinator (which automatically shares front page articles of Hacker News Websites) or@_ restaurant_ Robots like BOT, which pushes photos and links of random restaurants found through Google maps, are valuable. Arguably, these accounts make twitter a better place. They just don't have anyone behind the device to personally participate in the twitter ecosystem.
By contrast, most "spam" accounts are an unwelcome nuisance. Their activities range from selling propaganda and false information to those trying to sell products, inducing website clicks, pushing phishing attempts or malware, manipulating stocks or cryptocurrencies, and (perhaps worst) harassing or intimidating platform users.
Sparktoro's fake fan method (described in detail below) attempts to identify all these types of untrue users.
However, sparktoro's system does not attempt to identify twitter accounts that may be operated by humans irregularly but have some automated behavior (for example, a company account with multiple users, such as their own @ sparktoro, or a community account operated by one person, such as @ crawlingmondays of aleyda Solis). They don't know how twitter (or musk) might choose to classify these accounts, but they prefer a relatively conservative explanation of "spam / fake".
What percentage of active twitter accounts are spam or fake accounts
To get the most comprehensive answer, sparktoro applied a single spam / bogus account analysis process on five unique data sets (described below).
The data sets represented above are:
- Followerwonk random sample (44058 accounts) - followerwonk currently has 1.047 billion twitter data indexes, which are updated in a continuous cycle and take about 30 days. Any deleted account (by user or twitter) will be deleted and not included in the statistics. According to the definition of followerwonk, 130 million of them are "recently active", that is, they have tweeted in the past nine weeks and are public rather than "protected" (Twitter's term for private accounts).
Marc wrote code to randomly select public accounts from followerwonk's active database and pass them to sparktoro for analysis. Casey of sparktoro team further refreshed the list and ran 44058 public active accounts through their fake fan spam analysis program. It was found that the characteristics of 8555 accounts were highly related to fake / spam accounts. They believe that this data set represents the best, single answer to the question of how many active Twitter users may be spam or fake.
- Aggregate average value of false follower tools (about 500000 profiles run and more than 100 million accounts analyzed) - in the past three and a half years of operation, sparktoro's false follower tools have run on 501532 unique accounts and analyzed thousands of followers in each account, In total, there are more than 1 billion profiles (although these are not necessarily unique, and they do not track which profiles are analyzed as part of the process).
This represents the largest set of accounts they can get, but it includes an analysis of many old accounts that have not tweeted in the past 90 days, so it is likely to not meet Twitter's definition of mdaus (profitable daily active users). They included it for comparison and showed that the analysis including simple random twitter accounts (compared with those recently active accounts) may not be so accurate.
- All followers of @ elonmusk on twitter (93.4 million accounts) - given the unique interest in Musk's account and its central role in triggering this report, the research team believes that it is wise to include a complete analysis of nearly 100 million accounts following @ elonmusk. This data set includes old materials that have not tweeted in the past 90 days (not in line with Twitter's definition of mdaus).
- Active followers of @ elonmusk on twitter (26.8 million accounts) - conduct a fairer assessment of Musk's Twitter fans, including only accounts that have tweeted in the past 90 days. In order to match the analysis method of followerwonk, the sparktoro team only selected 268729 accounts that meet this standard and subdivided them in the figure above.
- Random sampling of 100 users following @ twitter accounts (100 accounts) -- in a follow-up tweet on Friday, May 13, musk said, "my team will randomly sample 100 followers of @ twitter; I invite others to repeat the same process to see what they find."
Although the sparktoro team did not consider the process to be a rigorous and statistically significant sample set, they included it for comparison. On May 14, they manually took a random account sample from @ twitter's fan public page. In order to get the sample with the least deviation, they only include public accounts, only accounts that have tweeted in the past 90 days (after February 12, 2022), and only accounts created before May 2021, that is, they have been on twitter for more than 1 year (many recent accounts, especially considering musk's activities, may bias the sample).
- Twitter's recent revenue report estimate (number of accounts unknown) - musk quoted Twitter's public revenue report in a recent tweet and shared & lt; 5% of mdaus (profitable daily active users, as defined in their 2019 report) are fake or spam. Sparktoro added this estimate to the chart for comparison and noted that its method was not disclosed.
There is no doubt that other researchers will make other estimates and hope to have the same large and rigorous data set. Given the limitations of Twitter's public data, the sparktoro team believes that the most accurate estimate is that 19.42% of the public accounts that have tweeted in the past 90 days are false or spam accounts.
What percentage of Musk's twitter followers are spam, fake or inactive
In October 2018, sparktoro analyzed the situation of Donald Trump, then US president All 54788369 followers on twitter. In this report, sparktoro copied this process and analyzed all 93452093 fans in Musk's profile (as of May 14, 2022).
When running fake follower reports through sparktoro's public tools, his team analyzed a sample of followers (thousands) of Twitter users. When an account has a large number of followers, this method may deviate from the situation shown by the comprehensive analysis of each follower. On Saturday, May 14 and Sunday, May 15, Casey Henry of sparktoro conducted this comprehensive analysis of Musk's account to provide the most accurate figures possible.
The above is the classification of some factors used in sparktoro's spam analysis system. In general, 70.23% of @ elonmusk fans are unlikely to be real and active users who see his tweets. This is much higher than the median number of fake fans, but it is not surprising for several reasons.
Very large accounts tend to have more fake / spam fans than other accounts
Accounts that receive a lot of media coverage and public attention (such as former US presidents trump and musk) tend to attract more false / spam followers than other accounts
Accounts that Twitter recommends to new users (usually including @ elonmusk) tend to get more fake / spam followers.
Compared with the distribution of other twitter accounts, the number of fake / spam followers of @ elonmusk may seem abnormal, but sparktoro does not believe or imply that musk is directly responsible for obtaining these suspicious followers. The most likely explanation is a combination of these factors, which is exacerbated by Musk's active use of twitter, media coverage of its tweets and Twitter's own recommendation system.
The sparktoro team also analyzed only 26.8 million @ elonmusk fans who tweeted in the past 90 days. This filter matches the filters they apply to the followerwonk dataset and random followers of @ twitter.
This more selective analysis found that 23.42% of people may have fake or spam accounts, which is not far from the estimated global average.
Why trust sparktoro and followerwonk's methodology
The data set of the above analysis (except for the random 100 followers of @ twitter, which the research team does not agree with this method) is large enough and the process is strict enough, and the results can be copied by any twitter researcher with similar public authority. The research team invited anyone interested to replicate the process used here on their own dataset (described in detail below). Twitter provides information about their API products here.
Followerwonk only randomly selects samples from accounts that have posted public tweets in the past 90 days, which is a clear sign of "activity". In addition, followerwonk regularly updates its profile database (every 30 days) to delete any protected or deleted accounts. They believe that this sample is large enough, statistically significant, and carefully planned to be closest to the profitable daily active users (mdau) that Twitter may think.
Sparktoro's fake fan analysis believes that if an account triggers sparktoro in their fake fan tool Many signals displayed in, then it is false.
Sparktoro's model for identifying fake accounts comes from the machine learning process of tens of thousands of known spam (and real) twitter accounts. Here's how sparktoro built this model.
In July 2018, the sparktoro team purchased 35000 fake twitter fan accounts from three different spam and robot account providers. Their suppliers let these accounts follow an empty twitter account, which was created in 2016 and had 0 fans in July 2018. It took ~ 3 weeks to deliver the 35000 fans. For the next three weeks, they will collect data on these fake / spam accounts every day.
In addition to the 35000 known spam accounts, the team randomly selected 50000 non spam accounts from sparktoro's large file index. This gives us a total of 85000 accounts that run through machine learning programs on Amazon Web services.
The 85000 accounts were divided into two groups, which mixed spam and non spam accounts. Group A is the training set and group B is the test set to analyze the performance of the model.
The following data were used to generate the initial model:
-Data picture
-Introduction URL
-Verified account status
-Language
-Twitter language
-Account age (days)
-Length of profile
-Number of followers
-Number of accounts they focus on
-Days since last tweet
-Number of tweets
-Number of times the account appears on the list
-Location
-Display name
After finding a model that matches the data, the sparktoro team analyzed the results to determine the characteristics closely related to spam. Not surprisingly, none of the features has a 1:1 correlation with spam. However, many features show promise when combined. The following are examples of features related to spam accounts.
-Information pictures - accounts that lack this information are often spam.
-Account age (days) - some patterns are clearly related to spam (for example, when a large number of accounts created in a day follow a specific account or send almost the same tweets).
-Number of followers - spam accounts tend to have very few followers
-Number of days since the last tweet - many spam accounts rarely tweet and tweet in a coordinated manner
-Number of times an account appears on the list - spam accounts almost never appear on the list
-Display name - some keywords and patterns are closely related to spam
However, these are not unique. Other signals with appropriate correlation with spam (especially when multiple signals apply to one account) also help to establish an effective model. Through trial and error (and, of course, pattern fitting), they have carefully designed a scoring system that can correctly identify more than 65% of spam accounts. They deliberately prefer to omit some fake / spam accounts rather than accidentally mark any real accounts as wrong.
The key is to remember that no factor can tell them that an account is spam! This is crucial. The more spam signals triggered, the more likely an account is to be spam. Our fake follower system requires at least a small part of 17 spam signals, sometimes as many as 10 (depending on which signals and their predictability), before an account is rated as "low-quality" or fake.
This method may underestimate the number of spam and false accounts, but it hardly includes false positives (i.e. claiming that an account is false, but it is not).
Applying this model to about 44000 random and recently active accounts provided by followerwonk, the quality score of each account can be obtained, as shown in the figure below.
The more spam related flags an account triggers, the lower its quality score in the system. Sparktoro's conservative approach means that we only regard the scores of 3, 2 and 1 as false / spam accounts, and the combination of these three produces their final estimate. The best explanation is that 19.42% of the recently active public twitter data are likely to be false or spam.