Question

我们有这段代码可从iframe中提取数据（感谢Cody）：

import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get("https://www.aliexpress.com/store/feedback-score/1665279.html")

soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("#detail-displayer").attrs["src"]

r = s.get(f"https:{iframe_src}")

soup = BeautifulSoup(r.content, "html.parser")
for row in soup.select(".history-tb tr"):
    print("\t".join([e.text for e in row.select("th, td")]))

返回此：

Feedback    1 Month 3 Months    6 Months
Positive (4-5 Stars)    154 562 1,550
Neutral (3 Stars)   8   19  65
Negative (1-2 Stars)    8   20  57
Positive feedback rate  95.1%   96.6%   96.5%

我们需要此输出，所有输出都在1行中：

我们如何做到？

Answer 1

仅set_index和unstack：

df：

                 Feedback 1 Month 3 Months 6 Months    store
0    Positive (4-5 Stars)     154      562    1,550  1665279
1       Neutral (3 Stars)       8       19       65  1665279
2    Negative (1-2 Stars)       8       20       57  1665279
3  Positive feedback rate   95.1%    96.6%    96.5%  1665279

然后：

df = df[~df['Feedback'].str.contains('Positive feedback rate')]
new = df.set_index(['store', 'Feedback']).unstack(level=1)
# use f-strings with list comprehension
new.columns = new.columns = [f'{x} {y[:3]}' for x,y in new.columns]

或者您可以使用pivot：

df = df[~df['Feedback'].str.contains('Positive feedback rate')]
new = df.pivot('store', 'Feedback')
new.columns = new.columns = [f'{x} {y[:3]}' for x,y in new.columns]

两者之间的性能大致相同：

unstack: 3.61 ms ± 186 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)

pivot: 3.59 ms ± 114 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)

Answer 2

这是完成工作的完整代码。

import pandas as pd
import requests
from bs4 import BeautifulSoup

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 50)

url = "https://www.aliexpress.com/store/feedback-score/1665279.html"
s = requests.Session()
r = s.get(url)

soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("#detail-displayer").attrs["src"]

r = s.get(f"https:{iframe_src}")

soup = BeautifulSoup(r.content, "html.parser")
rows = []
for row in soup.select(".history-tb tr"):
    print("\t".join([e.text for e in row.select("th, td")]))
    rows.append([e.text for e in row.select("th, td")])
print

df = pd.DataFrame.from_records(
    rows,
    columns=['Feedback', '1 Month', '3 Months', '6 Months'],
)

# remove first row with column names
df = df.iloc[1:]
df['Shop'] = url.split('/')[-1].split('.')[0]

pivot = df.pivot(index='Shop', columns='Feedback')
pivot.columns = [' '.join(col).strip() for col in pivot.columns.values]

column_mapping = dict(
    zip(pivot.columns.tolist(), [col[:12] for col in pivot.columns.tolist()]))
# column_mapping
# {'1 Month Negative (1-2 Stars)': '1 Month Nega',
#  '1 Month Neutral (3 Stars)': '1 Month Neut',
#  '1 Month Positive (4-5 Stars)': '1 Month Posi',
#  '1 Month Positive feedback rate': '1 Month Posi',
#  '3 Months Negative (1-2 Stars)': '3 Months Neg',
#  '3 Months Neutral (3 Stars)': '3 Months Neu',
#  '3 Months Positive (4-5 Stars)': '3 Months Pos',
#  '3 Months Positive feedback rate': '3 Months Pos',
#  '6 Months Negative (1-2 Stars)': '6 Months Neg',
#  '6 Months Neutral (3 Stars)': '6 Months Neu',
#  '6 Months Positive (4-5 Stars)': '6 Months Pos',
#  '6 Months Positive feedback rate': '6 Months Pos'}
pivot.columns = [column_mapping[col] for col in pivot.columns]

pivot.to_excel('Report.xlsx')

您可能希望手动对pivot.columns进行排序，因为它们是按字母顺序排序的（1 Month Negative (1-2 Stars)'在'1 Month Neutral (3 Stars)'之前）。设置好列的映射后，您只需为它们中的每一个选择一个合适的名称，然后它们就会被映射（因此，您不必在每次决定切换中立和负立位置时都对它们进行重新排序，因为实例）。这要归功于字典查找。

Python Pandas-将表数据重新排列为1行

2 个答案: