# Mining used-car sales¶

This is a running log of some work on used-car saled. I have no intention to use this information for financial purposes, rather I'd like to ask the question "is there regional variation in used-car prices?". To do this I will use the `BeautifulSoup`

package to look at www.pistonheads.com, my preferred way to search for used cars.

```
from BeautifulSoup import BeautifulSoup
import urllib
import pandas as pd
import seaborn
import numpy as np
import scipy.optimize as so
%matplotlib inline
import seaborn as sns
sns.set_style(rc={'font.family': ['sans-serif'],'axis.labelsize': 25})
sns.set_context("notebook")
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['axes.labelsize'] = 18
```

After some fiddling I have wrapped up the method to get a page of results (100 at a time) in a function:

```
def strip_results_from_ad(ad):
""" Strip out the information from a single advert and add it to the results dictionary"""
desc = ad.find("div", attrs={"class": "listing-headline"}).find("h3").text
loc = ad.findAll("p", attrs={"class": "location"})[0].text
price = int(ad.find("div", attrs={"class": "price"}).text.replace(u"£", "").replace(",", ""))
specs = ad.find("ul", attrs={"class": "specs"}).findAll("li")
if len(specs) == 4:
miles = int(specs[0].text.rstrip(" miles").replace(",", ""))
fuel = specs[1].text
bhp = int(specs[2].text.rstrip(" bhp"))
transmission = specs[3].text
else:
fuel = "NA"
bhp = np.nan
transmission = "NA"
try:
miles = int(specs[0].text.rstrip(" miles").replace(",", ""))
except:
# Except any error...!
miles = np.nan
return desc, loc, price, miles, fuel, bhp, transmission
def create_url(page=1, M=269, rpp=100):
base = ("http://www.pistonheads.com/classifieds?Category=used-cars"
"&M={M}&ResultsPerPage={rpp}&Page={page}")
return base.format(page=page, rpp=rpp, M=M)
def get_results(*args, **kwargs):
url = create_url(*args, **kwargs)
f = urllib.urlopen(url).read()
soup = BeautifulSoup(f)
ads = soup.findAll("div", attrs={"class": "ad-listing"})
results = {"desc":[], "loc":[], "price":[], "miles":[], "fuel":[], "bhp":[], "transmission":[]}
for ad in ads:
try:
desc, loc, price, miles, fuel, bhp, transmission = strip_results_from_ad(ad)
except:
break
results["desc"].append(desc)
results["loc"].append(loc)
results["price"].append(price)
results["miles"].append(miles)
results["fuel"].append(fuel)
results["bhp"].append(bhp)
results["transmission"].append(transmission)
return results
```

### An example¶

Here we get 100 results from the url with "M" 269. On the pistonheads website this corresponds to a BMW 1 series. I choose this search in particular as there isn't huge variation in the specs, most have a similar engine size and there are few special runs. Let's have a look at the first few results

```
r = get_results(M=269, rpp=20)
df = pd.DataFrame(r)
df
```

This looks good so far, now from the source website we see there is 2,580 entries so let's pull down a few and see what we can do with the data. The best way to do this is to iterate through the page counts and concatanate the resulting data frames:

```
dfs = []
for page in xrange(1, 25):
r = get_results(M=269, rpp=100, page=page)
dfs.append(pd.DataFrame(r))
```

```
df = pd.concat(dfs)
len(df)
```

Okay so now we have a good number of adverts to look at.

### Mileage VS Price¶

```
ax = df.sort("price").plot("miles", "price", kind="scatter")
```

This is kind of messy and I would be interested to know what those low-milage high price outliers indicate

### Outliers¶

We will use the `pandas`

query module (which requires `numexpr`

to be installed) to have a look at these outliers.

```
df.query("miles < 50000 and price > 30000")
```

As we can see these are all the "1M" series, a much faster version. The BHP is 335 which is significantly greater than the average:

```
df.bhp.mean()
```

### Averaged Price VS Mileage¶

Let's clean the data by removing all `NaN`

(this is typically ill-advised, but this is just for fun) and also get rid of those high-performance cars

```
df_clean = df[~np.isnan(df.price)]
df_clean = df_clean[~np.isnan(df_clean.miles)]
df_clean = df_clean[df_clean.price < 30000]
df_clean = df_clean.sort("miles")
```

And now we have a clean collection, we will apply a rolling mean to the price and milage:

```
ax = plt.subplot(111)
window = 100
mean_miles = pd.rolling_mean(df_clean.miles, window, center=True)
mean_price = pd.rolling_mean(df_clean.price, window, center=True)
# Drop the nans created in the rolling_window
mean_miles = mean_miles[~np.isnan(mean_miles)]
mean_price = mean_price[~np.isnan(mean_price)]
ax.plot(df_clean.miles, df_clean.price, "o", alpha=0.3, markersize=5)
ax.plot(mean_miles, mean_price, lw=3, color=seaborn.xkcd_rgb["pale red"])
ax.set_xlabel("Mileage")
ax.set_ylabel(u"Price (£)")
plt.show()
```

### Fitting to the data¶

We may now imagine coming up with a model for the price. The obvious answer is to begin by stating that the depreciation is proportional to the mileage, therefore:

$$ \frac{dP}{dm} = -k P $$where $P$ is the price, $m$ is the mileage and $k$ is a constant of proportionality. Solving in the usual way yields:

$$ P(m) = P_{0} e^{-km} $$Let's try fitting this:

```
import scipy.optimize as so
ax = plt.subplot(111)
def P(m, P0, k):
return P0 * np.exp(-k * m)
popt, pcov = so.curve_fit(P, mean_miles, mean_price, p0=[20000, 1e-10])
ax.plot(df_clean.miles, df_clean.price, "o", alpha=0.3, markersize=5)
ax.plot(mean_miles, mean_price, lw=3, color=seaborn.xkcd_rgb["pale red"])
fit_miles = np.linspace(df_clean.miles.min(), df_clean.miles.max(), 1000)
ax.plot(fit_miles, P(fit_miles, *popt), zorder=10)
ax.annotate(xy=(0.6, 0.7), xycoords="figure fraction",
s=("$P_0={:5.0f} \pm {:5.0f}$\n$k={:1.2e} \pm {:1.2e}$".format(
popt[0], np.sqrt(pcov[0, 0]), popt[1], np.sqrt(pcov[1, 1]))),
fontsize=20)
ax.set_xlabel("Mileage")
ax.set_ylabel(u"Price (£)")
plt.show()
```

This model seems fairly able to predict the average used car sale price. The parameter $P_0$ measures the value at 0 miles, of course all used-cars should have at least some mileage to qualify as used, but a surprising number have very low mileage:

```
len(df.query("miles < 500"))
```

As a result we have predicted the initial price to be $20259 \pm 23$ which isn't all that far off the actual new-car prices $21180-21710$.

```
```