When you do your due diligence before buying a website, you frequently find backlinks from hacked websites in Ahrefs (and probably in other tools). This is a usual attack: the black hat guy is hacking websites to sell his backlink locations.
Studying these hacked websites, I noticed a pattern: the malevolent web pages are located inside a subdirectory with a short random name.
I immediately asked myself if this could be automated. Building a list of hacked websites could be interesting – more on this later.
The process
To achieve that, I implemented the following process:
- Getting some niche websites with affiliate links as they are more probabilities to have such backlinks,
- Searching the backlinks in Ahrefs and exporting the list
- Put the exported file as input of a custom script to extract all hacked websites pointing to the niche website.
Obviously, the two first steps are not difficult.
Exporting from Ahrefs
To get some niche websites, use Google with requests like “what are the best vacuum cleaners“.
To get the backlinks of these sites in Ahrefs, I use the following filters:
- I set the minimum DR to 5. It could be more. The goal is to filter out very uninteresting websites.
- I toggle New and “Live Links only“, with the maximum history allowed by my account (6 months).
With these options enabled, the export file will contain only live backlinks.
Developing the script
The script will parse the exported file from Ahrefs, and then, for each backlink, see if the first directory of the URL has a random name.
For instance, considering the URL https://site.com/dir1/dir2/file.html, the script will say: “dir1” is a random name, so the website has likely been hacked.
Knowing if a directory is a random name is not trivial as it should be made with some machine-learning algorithms. I used the Python library Gibberish Detector, which implements a Markov chain to do the work.
As explained in the instructions, the model should be trained first, but it’s very fast:
gibberish-detector train examples/big.txt > big.model
I put the file big.model in my PHP projects. Then, I have this function:
public function detect()
{
$parsedUrl = parse_url($this->url);
if ($parsedUrl['scheme'] !== 'https') {
return false;
}
$sections = explode('/', $parsedUrl['path']);
if (strlen($sections[1]) < 5) {
return false;
}
if (count($sections) >1) {
$command = 'gibberish-detector detect --model "../big.model" --string "'.$sections[1].'"';
$res = trim(`$command`);
return ($res === 'False') ? false : true;
}
return false;
}
For each backlink, we dismiss the ones coming from a non-SSL website (as it may indicate that this is not a serious website).
Then, we dismiss all websites where the first directory has less than 5 characters, as it corresponds to the pattern explained above.
Once done, we launch our Gibberish Detector and return the boolean value.
Launching a Python command from PHP is a bit ugly, but we don’t care.
The results
It works, but we still have some false-positive.
To improve this script, we can also dismiss all URLs where the first directory contains the character “%” as it indicates encoded characters usually used for some foreign languages. Therefore, I added this test:
if ((strpos($sections[1], '%') !== false) || (strlen($sections[1]) > 10)) {
return false;
}
Let’s re-run our script and see the final results.
Statistics for this first try
The statistics are as follows:
- The Ahrefs exported file contained 4986 backlinks.
- ~ 319 of these backlinks are on hacked websites.
I don’t know yet if this is the same for every niche website, but it means in this example that AT LEAST 6% of backlinks come from hacked websites. This is HUGE.
Next steps
I am unsure if this data and process to find hacked websites can be exploited profitably, blackhat or whitehat. I would not do it with a blackhat-way, but it’s still good to know that it can be done.
I will detail these aspects in a future article to push the idea further.
I will also try to build a significant sample to confirm the 6% statistic.
Leave a Reply