当前位置: 首页 > news >正文

Pandas - read_html()

When I ran below code: df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population') I got error: HTTPError: HTTP Error 403: Forbidden I can open that URL in browser.
 
ChatGPT said:

This is a common issue when using pandas.read_html() on Wikipedia.
Even though you can open the page in a browser, Wikipedia blocks requests that do not have a User-Agent header, and read_html() uses a simple request underneath (via urllib) that often gets rejected with HTTP 403 Forbidden.

✅ Why your code failed

  • Browser requests include headers like user-agent, cookies, etc. → Wikipedia allows.

  • pandas.read_html() default request looks like an automated bot. → Wikipedia blocks → 403 Forbidden.

✅ How to fix it

You need to supply a proper User-Agent header by downloading the page yourself (using requests) and then passing the HTML to read_html().

✔ Working code

import pandas as pd
import requestsurl = "https://en.wikipedia.org/wiki/List_of_countries_by_population"
headers = {"User-Agent": "Mozilla/5.0"}response = requests.get(url, headers=headers)
response.raise_for_status()   # ensure no HTTP errors

dfs = pd.read_html(response.text)
print(len(dfs))
dfs[0].head()

This works because:

  • requests.get() sends a normal browser-like User-Agent.

  • Wikipedia returns the page.

  • pd.read_html() parses tables from the HTML content.

You should see the first table as expected.

 

You said:
In the code you gave: headers = {"User-Agent": "Mozilla/5.0"} But my browser is Chrome, will it still work or do I need to modify the code?
 
ChatGPT said:

You do not need to modify anything
User-Agent: "Mozilla/5.0" is a generic, browser-like User-Agent and works perfectly for Wikipedia.

Why it works

Most modern browsers (including Chrome, Edge, Firefox) send a User-Agent string that begins with:

Mozilla/5.0

For example, your actual Chrome User-Agent probably looks like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36

But Wikipedia does not require the full long string — it only checks that the request looks like it came from a browser, not a script.

✔ Conclusion

  • headers = {"User-Agent": "Mozilla/5.0"} is enough

  • No need to pretend to be Chrome specifically

  • The code will work exactly the same