Showing posts with label mysql. Show all posts
Showing posts with label mysql. Show all posts

Friday, October 7, 2016

Analyzing free proxies from public websites


Recently,  I've been scraping free proxies from public websites like hidemyass and incloak.

I added another source to the scraper to pull proxies from my-proxy.com and stored the proxies to my local MySQL database.

Here's the table structure:

+------------------+--------------+------+-----+-------------------+----------------+
| Field            | Type         | Null | Key | Default           | Extra          |
+------------------+--------------+------+-----+-------------------+----------------+
| id               | int(11)      | NO   | PRI | NULL              | auto_increment |
| scrape_timestamp | datetime     | YES  |     | CURRENT_TIMESTAMP |                |
| last_update      | varchar(255) | YES  |     | NULL              |                |
| ip_address       | varchar(255) | YES  |     | NULL              |                |
| port             | varchar(255) | YES  |     | NULL              |                |
| country          | varchar(255) | YES  |     | NULL              |                |
| speed            | int(11)      | YES  |     | NULL              |                |
| connection_time  | int(11)      | YES  |     | NULL              |                |
| type             | varchar(255) | YES  |     | NULL              |                |
| anonymity        | varchar(255) | YES  |     | NULL              |                |
| source           | varchar(255) | YES  |     | NULL              |                |
| tested           | tinyint(1)   | YES  |     | NULL              |                |
| verified         | tinyint(1)   | YES  |     | NULL              |                |
| used             | tinyint(1)   | YES  |     | NULL              |                |
| date_used        | datetime     | YES  |     | NULL              |                |
+------------------+--------------+------+-----+-------------------+----------------+

Here's the data gathered after using whatismyipaddress.com for verifying the list.

SELECT t1.source,
       t2.total,
       t1.verified,
       (SELECT Truncate(( t1.verified / t2.total * 100 ) + 0.06, 0)) AS
       percent_working
FROM   (SELECT source,
               Count(*) AS verified
        FROM   proxies
        WHERE  verified = 1
        GROUP  BY source) AS t1
       LEFT JOIN (SELECT source,
                         Count(*) AS total
                  FROM   proxies
                  GROUP  BY source) AS t2
              ON t1.source = t2.source; 



+-----------+-------+----------+-----------------+
| source    | total | verified | percent_working |
+-----------+-------+----------+-----------------+
| hidemyass |   309 |       74 |              24 |
| incloak   |   436 |       78 |              17 |
| my-proxy  |   900 |       55 |               6 |
+-----------+-------+----------+-----------------+

Graph generated here: https://jsfiddle.net/L3vnuyx6/


my-proxy has the highest number of free proxies. It also has the lowest number of verified working proxies.

hidemyass has the highest percentage (verified / total)

incloak has the highest number of verified proxies in terms of count.